Database Optimization: Troubleshoot Slow Database Query

The Java Trail
15 min readDec 2, 2023

--

Embarking on the journey of database optimization is like unlocking the hidden potential of your application. In this exploration, we’ll navigate through key strategies to fine-tune your database for lightning-fast performance.

First, we’ll unravel the magic of indexing — the art of organizing data for quick retrieval. From B-trees to hash indexes, each technique has its unique charm, tailored to specific data scenarios.

Next, we delve into the realm of query optimization, rewriting commands, and avoiding common pitfalls. Filtering data early, limiting results, and optimizing JOIN operations set the stage for efficient query execution.

Caching takes center stage, with techniques to store and retrieve frequently accessed data, enhancing response times. However, we’ll also tackle the challenges of cache invalidation and coherence for synchronized data.

Strategies to Optimize the Performance of a Slow Database Query:

1. Indexing:

Indexing involves creating a data structure (Balanced Tree/ HashTable) that allows the database management system to quickly locate and retrieve rows from a table.

B-tree Index: Suitable for low cardinality data (few distinct values), supporting range queries and offering balanced search and insertion operations.

In a customer database, where the “status” column has a low cardinality (few distinct values like ‘active’ or ‘inactive’). A B-tree index on the “status” column would efficiently support queries to retrieve all active customers.

Bitmap Index: Effective for high cardinality columns, using bitmap vectors for presence or absence representation. Ideal for multiple condition queries with AND or OR operators.

Hash Index: Best for point lookups on columns with high cardinality.

In a user authentication system, the “user_id” column might have high cardinality. Using a hash index on the “user_id” column allows for fast point lookups when verifying user credentials during login.

GIST/GIN Index: Suitable for complex data structures like JSONs or hashtables, especially when knowledge about internal entity structure exists.

Trigram Index: Apt for understanding the context of words appearing together, improving performance of the LIKE operator with wildcards.

In a text search application, where users want to find words that appear together, a trigram index on a column containing text can enhance the performance of queries involving the LIKE operator with wildcards. For instance, finding phrases like “data analysis” within a large dataset.

-- Creating an index on the 'username' column
CREATE INDEX idx_username ON users(username);

*Common techniques for index identification:

Identify Frequently Accessed Columns for Where/JOIN/order_by: Analyze frequently executed queries to identify involved columns, creating indexes on them for performance improvement.

In an e-commerce platform, analyzing frequently executed search queries to identify columns such as “product_name” and “category” can lead to the creation of indexes on these columns, improving search performance.

Identifying columns with high cardinality is indeed a common technique for determining candidates for indexing in a database. High cardinality refers to columns with a large number of distinct values relative to the total number of rows in the table. Indexing such columns can significantly improve query performance, as it allows the database engine to quickly locate specific rows based on the indexed values. H

Composite Indexes: For queries involving multiple WHERE clause columns, consider creating composite indexes covering all used columns to reduce the number of indexes and optimize query performance.

In a sales database, where reports frequently filter data based on both “region” and “sales_date,” creating a composite index on these columns can significantly optimize queries seeking sales information for a specific region and date range.

Avoid Over-Indexing: Creating excessive indexes can increase overhead during data modifications (inserts, updates, and deletes). Striking a balance is crucial for optimal database performance.

In a logging system capturing high-frequency events, creating indexes on every column may lead to increased overhead during data insertion. Selectively indexing critical columns, like timestamps or unique identifiers, helps maintain a balance between query performance and data modification efficiency.

Database indexing, while a powerful performance booster, comes with both pros and cons to consider. Let’s delve into these:

Pros:

  • Faster search and retrieval: Indexes allowing queries to quickly locate relevant data without scanning the entire database. This significantly improves performance for queries focusing on indexed columns.
  • Improved sorting and grouping: Sorting and grouping operations based on indexed columns also benefit from their organization, leading to faster execution.
  • Data integrity and validation: Certain indexes (unique and primary) enforce data integrity by disallowing duplicate entries and ensuring referential consistency.

Cons:

  • Increased storage overhead: Creating and maintaining indexes consumes additional storage space. The more complex the index, the greater the space requirement.
  • Slower data modification: Inserting, updating, or deleting data becomes slightly slower when indexes are involved, as the index itself needs to be updated alongside the actual data.

Considerations when deciding:

  • Frequency of data modification: If your data is frequently modified, the performance penalty of updating indexes might outweigh the benefits of faster searches.
  • Query patterns: Consider which columns are most commonly used in queries and prioritize indexing those columns.
  • Available storage and resources: Ensure you have adequate storage space and processing power to maintain indexes without impacting overall performance.
  • Alternative search mechanisms: Consider alternatives like full-text search or materialized views depending on your specific needs.

2. Query Optimization:

Utilize Indexes: Ensure relevant columns in WHERE clauses and JOIN conditions have appropriate indexes.

Query Rewriting: Simplify complex queries by reducing joins or subqueries, aiding the database in finding more efficient execution plans.

SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE c.country = 'USA';

/*This query involves a complex join. The good query simplifies it
by using a subquery to filter customers based on country before joining.*/

SELECT * FROM orders o WHERE o.customer_id
IN (SELECT customer_id FROM customers WHERE country = 'USA');

Avoid SELECT * : Specify only necessary columns in the SELECT statement instead of retrieving all columns from a table.

SELECT * FROM products WHERE category = 'Electronics';

//good query

SELECT product_id, product_name, price FROM products
WHERE category = 'Electronics';

Early Data Filtering: Employ WHERE clauses to filter data early in the query execution process, enhancing efficiency. Filtering on a column with high selectivity (i.e., a column with a small number of distinct values) first can significantly reduce the number of rows to be considered, potentially leading to a more efficient execution plan.

SELECT * FROM sales WHERE sale_date BETWEEN '2023-01-01' AND '2023-12-31' 
AND product_id = 101;

/*if "product_id" is more selective than "sale_date,"
the reordered conditions may result in faster query performance.*/


SELECT * FROM sales WHERE product_id = 101 AND
sale_date BETWEEN '2023-01-01' AND '2023-12-31';

Result Limiting: Use LIMIT in your queries is a good practice, especially when dealing with large datasets or when you only need a specific subset of the data.

SELECT * FROM customers LIMIT 10; //paginated result set

Use UNION ALL Instead of UNION: When combining results from multiple queries, consider using UNION ALL instead of UNION. Using UNION ALL skips the duplicate removal process, making it faster because it doesn’t incur the overhead of checking and eliminating duplicate rows.

SELECT employee_id, employee_name FROM permanent_employees
UNION ALL
SELECT employee_id, employee_name FROM contract_employees;

Optimize Subqueries: If using subqueries, ensure they are optimized and not adversely impacting performance. Consider using JOINs clauses as alternatives to improve query execution. Subqueries may be evaluated repeatedly for each row, leading to redundant operations. Optimized queries with JOINs reduce redundancy by leveraging set-based operations, resulting in more efficient processing.

//bad query
SELECT employee_id, employee_name
FROM employees
WHERE department_id IN (SELECT department_id FROM departments
WHERE location = 'XYZ');

//good query
SELECT e.employee_id, e.employee_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id
WHERE d.location = 'XYZ';

3. Update and Delete in Batches:

For large update or delete operations, break them into smaller batches. This prevents long-running transactions and reduces the likelihood of locking issues.

Example Query for Updating in Batches: This query updates the employee_salary table in batches of 1000 rows at a time. The loop continues until no more rows meet the specified condition, preventing long-running transactions and minimizing the impact on the database.

-- Bad Query (Updating all rows in a single transaction)
UPDATE employee_salary
SET salary = salary * 1.1
WHERE department_id = 5;

/*Updating all rows in a single transaction can be resource-intensive and
might lead to locking issues, especially in a high-transaction environment.*/

-- Good Query (Updating in Batches)
DECLARE @BatchSize INT = 1000;
DECLARE @RowsAffected INT = 1;

WHILE @RowsAffected > 0
BEGIN
UPDATE TOP(@BatchSize) employee_salary
SET salary = salary * 1.1
WHERE department_id = 5;

SET @RowsAffected = @@ROWCOUNT;
END;

Example Query for Deleting in Batches:

-- Bad Query (Deleting all rows in a single transaction)
DELETE FROM employee_salary
WHERE termination_date < '2022-01-01';

-- Good Query (Deleting in Batches)
DECLARE @BatchSize INT = 1000;
DECLARE @RowsAffected INT = 1;

WHILE @RowsAffected > 0
BEGIN
DELETE TOP(@BatchSize) FROM employee_salary
WHERE termination_date < '2022-01-01';

SET @RowsAffected = @@ROWCOUNT;
END;

4. Query Result Caching: Hibernate+Redis Cache

Database caching is a powerful technique to accelerate response times and enhance application responsiveness. Various caching methods can be employed to optimize performance:

  • Application-Level Caching: Utilize in-memory caches within your application. Memcached or Redis for distributed caching.
  • Database-Level Caching: MySQL query cache or PostgreSQL caching
  1. Query Result Caching: Store the outcomes of frequently executed queries in memory. Subsequent requests with similar queries can retrieve cached results instead of re-executing the query, reducing response times.
  2. Object Caching: Cache frequently accessed data objects, such as user profiles or product information, in memory. This minimizes the need to repeatedly fetch the same data from the database. In stateful system Keeping user authentication tokens in memory to avoid constant database queries for user credentials for every call.

Challenges with Database Caching:

Invalidation (Eventual Consistency with the Datastore): Invalidation is the process of updating or removing cached data when the corresponding data in the database undergoes modifications. Ensuring that cached information remains current is essential to prevent serving outdated data.

  • Time-Based Invalidation: Regularly refresh or clear the cache based on predefined time intervals. Example: Refresh cache every hour.
  • Event-Based Invalidation: Trigger cache updates when specific events, such as data modifications, occur. Example: Invalidate cache when a user updates their profile.

Coherence: Cache coherence refers to maintaining consistency between the data stored in the cache and the data in the database. Various techniques, such as time-based or event-based invalidation, are employed to ensure synchronized and accurate data.(Implemented by Redis pub-sub messaging/ Kafka + Debezium, where DB is producer cache is subscriber)

  • Example: Implementing a mechanism that automatically updates the cache when changes occur in the database to avoid discrepancies

5. Table Partitioning (Single Database Machine):

CREATE TABLE sales (
sale_id INT PRIMARY KEY,
product_id INT,
sale_date DATE,
sale_amount DECIMAL(10, 2),
-- Additional columns...
);
-- Retrieve total sales amount for a specific date range
SELECT SUM(sale_amount)
FROM sales
WHERE sale_date BETWEEN '2023-01-01' AND '2023-12-31';

In this scenario, the ‘sales’ table could potentially contain millions or billions of rows, and running the query as it is might be slow due to the need to scan the entire table.

CREATE TABLE sales_partitioned (
sale_id INT,
product_id INT,
sale_date DATE,
sale_amount DECIMAL(10, 2),
-- Additional columns...
) PARTITION BY RANGE (YEAR(sale_date)) (
PARTITION p0 VALUES LESS THAN (2022),
PARTITION p1 VALUES LESS THAN (2023),
PARTITION p2 VALUES LESS THAN (2024),
-- Add more partitions as needed...
);

Query on Partitioned Table:

-- Retrieve total sales amount for a specific date range
SELECT SUM(sale_amount)
FROM sales_partitioned
WHERE sale_date BETWEEN '2023-01-01' AND '2023-12-31';

When querying for sales within a specific date range, the database engine only needs to scan the partitions that contain relevant data (in this case, the partition for the year 2023). This significantly reduces the amount of data to be scanned compared to scanning the entire ‘sales’ table. The database engine can skip irrelevant partitions, leading to improved performance for date-range-specific queries

6. Using Proper Joins:

Let’s consider an example where we have two tables: orders and customers. The goal is to retrieve information about orders along with the corresponding customer names.

--Original Query
SELECT orders.order_id, orders.order_date, customers.customer_name
FROM orders, customers
WHERE orders.customer_id = customers.customer_id;

--Optimal Query
SELECT orders.order_id, orders.order_date, customers.customer_name
FROM orders
INNER JOIN customers ON orders.customer_id = customers.customer_id;
  • In the original query, a Cartesian product is formed as a result of the comma (,), and then the WHERE clause filters out the rows where customer_id matches. This can be highly inefficient, especially when dealing with large datasets.
  • The optimized query uses INNER JOIN explicitly, indicating to the database engine that it should only return rows where there is a match based on the specified condition (orders.customer_id = customers.customer_id). This is more readable and generally more efficient

7. Database Normalization

Database normalization aims to enhance data integrity by minimizing duplication and inconsistencies. Each piece of data should be stored in only one place, ensuring a more efficient and reliable database structure.

Benefits of Normalization:

Data Integrity: Reducing data redundancy minimizes the risk of inconsistencies and errors in the database.

Smaller Tables: Normalization leads to smaller, more focused tables, improving query performance.

Improved Indexing: Normalized data allows indexes to be more effective, resulting in faster data retrieval and reduced disk I/O.

8. Scaling on Increase Write Load in Database: Shard

Scaling a system to handle increased write loads involves strategies such as sharding (Data Partitioning) and horizontal scaling. Sharding/Partitioning involves breaking down a large database into smaller, more manageable parts called shards, while horizontal scaling involves adding more servers to distribute the load. In MySQL, achieving this involves careful planning and execution.

benefits of sharding, different types of sharding, comparison in between

Sharding/Partitioning:

Horizontal Scaling: Sharding enables horizontal scaling by distributing data across multiple servers (shards), allowing for growth without relying on a single, large server.

  • Shard Key: The field that determines how data is partitioned among shards (e.g., user ID, region, date).
  • Routing Server: Directs requests to the correct shard based on the shard key.
  • Configuration Server: Stores metadata about the sharded cluster, ensuring routing accuracy.
  • Replica Sets: Groups of servers within each shard that maintain identical data copies for high availability.

Example: E-commerce Platform

Scenario: Millions of products and orders, requiring efficient management.

Sharding Approach: Shard Key: Product ID

  • Shard 1 stores products 1–100,000.
  • Shard 2 stores products 100,001–200,000.
  • Shard 3 stores products 200,001 and beyond.

Query Examples:

  • Viewing Product Details: User requests information for product 45,678. The routing server, using metadata from the configuration server, directs the query to Shard 2 (where products 100,001–200,000 reside).
  • Placing an Order: User orders products 23,456 and 350,002. The routing server sends write operations to Shard 1 for product 23,456 and Shard 3 for product 350,002.

Considerations:

  • Balancing Shards: Careful shard key selection ensures even data distribution across shards for optimal performance.
  • Joins Across Shards: Queries involving data from multiple shards can be more complex and potentially less efficient.

=========================================================

** Solving Sharding Related Issues:

Problem 1: Balancing Shards

Scenario Example: Let’s consider a scenario where an e-commerce platform decides to shard its customer data based on the ‘user_id.’ Initially, the system is designed with three shards, each handling a specific range of user IDs:

  • Shard 1: Users with IDs 1–1000
  • Shard 2: Users with IDs 1001–2000
  • Shard 3: Users with IDs 2001 and above

Over time, due to user growth, the Shard 3 becomes significantly larger than the other shards, causing an imbalance. This imbalance can result in uneven distribution of the workload, leading to performance issues on Shard 3 and underutilization of Shards 1 and 2.

Solution 1. Resharding:

  • Periodically monitor data distribution across shards.
  • When a significant imbalance is detected, initiate a resharding operation.
  • In this scenario, you may decide to split Shard 3 into two new shards (Shard 3 and Shard 4) to balance the load.
# Example: Trigger a resharding operation
vtctlclient ApplySchema -sql='ALTER VITESS_SHARDED.customer ADD COLUMN new_column INT' ecommerce

Solution 2 Key Range Adjustments:

  • Consider adjusting key ranges for new shards during resharding to achieve a more balanced distribution.
  • Use tools provided by your sharding solution to manage key range adjustments.
# Example: Adjust key ranges during resharding
vtctlclient SplitClone -source_shard=0 -destination_shard=4 ecommerce/0

Problem 2: Joins Across Shards

Scenario Example: Consider an e-commerce platform that sharded its ‘orders’ and ‘order_items’ tables based on ‘order_id.’ Each shard is responsible for a specific range of order IDs. Now, when trying to retrieve detailed information about an order, a join across shards might be necessary if an order spans multiple shards. This scenario introduces complexity, as traditional SQL joins typically operate within a single database, and distributed joins across shards can be less efficient.

Solution 1 Denormalization

  • Denormalization involves duplicating certain information across shards to reduce the need for joins.
  • In this example, we store essential customer information in the ‘orders’ shard, eliminating the necessity of frequent joins with the ‘customer’ shard when querying orders.
-- Example: Denormalize customer information in the 'orders' shard
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount DECIMAL,
customer_name VARCHAR(255),
customer_email VARCHAR(255),
PRIMARY KEY (order_id),
INDEX (customer_id)
);

Solution 2: Code Logic, Fetch different shard data in different query

  • performing multiple queries to different shards and consolidating results in the application code.
  • In this example, we fetch order information from the ‘orders’ shard and customer information from the ‘customer’ shard, then assemble the final result.
  • Potential for increased latency due to multiple queries.
// Fetch order and customer information from different shards
public Order getOrderDetails(int orderId) {
Order order = orderRepository.findByOrderId(orderId); // Query 'orders' shard
Customer customer = customerRepository.findByCustomerId(order.getCustomerId()); // Query 'customer' shard
order.setCustomer(customer);
return order;
}

Solution 3. Sharding Proxy (Vitess) without any application change:

  • Sharding proxies, like Vitess, can handle complex queries involving multiple shards by routing the query appropriately.
  • In this example, Vitess processes a join between the ‘orders’ and ‘customers’ shards based on the ‘customer_id.’
-- Example (Vitess): Query with join across shards
SELECT orders.*, customers.customer_name
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
WHERE orders.order_id = 123;
  • Centralized management of query routing and execution.
  • Enables complex queries without significant application changes.

=========================================================

**Sharding MySQL using Vitess

I’ll provide detailed implementation steps for sharding MySQL using Vitess, incorporating best practices and addressing potential challenges:

1. Prerequisites:

  • MySQL Version: Ensure compatibility with Vitess (5.6, 5.7, or 8.0).
  • Environment: Set up dedicated servers for Vitess components (Vttablets, Vtgates, etc.).
  • Tools: Install Git, Go, and required dependencies for Vitess.

2. Install Vitess:

  • Download Vitess source code from GitHub.
  • Compile and install using Go build tools.
# Clone the Vitess repository
git clone https://github.com/vitessio/vitess.git
cd vitess

# Compile and install
make build

3. Topology Planning:

  • Define number of shards and replicas per shard based on workload and availability needs.
  • Consider future growth and potential for reshards.
# Example: Create a keyspace named 'ecommerce' with two shards and one replica each
vtctlclient CreateKeyspace -sharding_column_name=customer_id -sharding_column_type=INT32 ecommerce
vtctlclient RebuildVSchema -cells=test
vtctlclient RebuildKeyspaceGraph -cells=test ecommerce

vtctlclient CreateShard -force ecommerce/-80
vtctlclient CreateShard -force ecommerce/80-

vtctlclient RebuildKeyspaceGraph -cells=test ecommerce
  • This code snippet creates a keyspace named ‘ecommerce,’ specifying ‘customer_id’ as the sharding column.
  • The RebuildVSchema and RebuildKeyspaceGraph commands update the internal Vitess schema and graph for the 'ecommerce' keyspace.
  • Two shards, ‘ecommerce/-80’ and ‘ecommerce/80-’, are created to cover the sharding range.
  • The RebuildKeyspaceGraph command ensures the internal Vitess graph is updated with the new shard information.

4. Keyspace and Shard Creation:

  • Use vtctlclient tool to create keyspaces (databases) and shards.
  • Assign physical MySQL instances to shards as Vttablets.
  • Initialize replication for each shard.
# Example: Initialize Vttablets for shard 0
vtctlclient InitShardMaster -force ecommerce/0 ecommerce-0000000100
vtctlclient RebuildKeyspaceGraph -cells=test ecommerce
  • This code initializes a Vttablet (Vitess tablet) for the master of shard 0 in the ‘ecommerce’ keyspace.
  • The RebuildKeyspaceGraph command updates the internal Vitess graph with the newly added Vttablet.

5. Routing Rules Configuration:

  • Define routing rules using VSchema for mapping queries to shards based on shard key.
  • Configure Vtgate (routing layer) with these rules.
# Example: Configure VSchema for 'ecommerce' keyspace
vtctlclient ApplyVSchema -vschema_file=vschema.json ecommerce
  • This code applies a VSchema configuration to the ‘ecommerce’ keyspace, defining how queries should be routed based on the shard key.
  • The vschema.json file contains the VSchema configuration with rules for sharding.
{
"sharded": true,
"vindexes": {"hash": {"type": "hash"}},
"tables": {
"customer": {
"column_vindexes": [
{"column": "customer_id", "name": "hash"}
]
}
}
}

This example VSchema configuration indicates that the ‘customer’ table is sharded using a hash-based strategy on the ‘customer_id’ column.

6. (Optional) Data Migration:

  • If migrating existing data, use Vitess tools for efficient distribution across shards.
  • Plan for downtime or use online migration techniques if required.
# Example: Use vstreamer to distribute data
vstreamer --vschema_file=vschema.json --cell=test --keyspace=ecommerce

This code uses the Vitess tool vstreamer to efficiently distribute data across shards based on the provided VSchema configuration (vschema.json).

7. Application Changes:

  • Update application code to connect to Vtgate instead of direct MySQL connections.
  • Handle sharding-aware queries if necessary (e.g., using Vitess client libraries).

8. Monitoring and Management:

  • Use Vitess monitoring tools (Vttablet metrics, etc.) for performance and health tracking.
  • Manage data distribution, handle failovers, and perform reshards as needed
  • Modify your Java Spring application code to connect to Vtgate (Vitess gateway) instead of direct MySQL connections.

public class VitessService {

private static final String VITESS_GATEWAY_HOST = "localhost";
private static final int VITESS_GATEWAY_PORT = 15001;
private static final String KEYSPACE_NAME = "test_keyspace";

private VitessConnectionPool connectionPool;

public VitessService() {
VitessConnectionFactory connectionFactory = new VitessConnectionFactory(
VITESS_GATEWAY_HOST, VITESS_GATEWAY_PORT, KEYSPACE_NAME);
connectionPool = new VitessConnectionPool(connectionFactory);
}

public void executeShardingQuery(int customerId) throws VitessException {
try (VitessConnection connection = connectionPool.getConnection()) {
// Execute sharding-aware queries
String query = "SELECT * FROM customer WHERE customer_id = ?";
connection.prepareStatement(query).bind(1, customerId).executeQuery();
// Process the results as needed
}
}
}
  • This Java Spring example shows how to use the Vitess Java client library to connect to Vtgate.
  • The VitessConnectionFactory and VitessConnectionPool are used to create and manage connections to the Vitess gateway.
  • The executeShardingQuery method demonstrates executing a sharding-aware query on the 'customer' table using the 'customer_id' as the sharding key.

Importance of Query Optimization and Indexing in Database Performance:

  1. Faster Data Retrieval: Efficient queries and proper indexing enable the database engine to quickly locate and retrieve specific data, reducing the time it takes to execute queries.
  2. Resource Utilization: Optimized queries consume fewer resources, ensuring efficient use of CPU, memory, and disk I/O. This helps in preventing resource contention and bottlenecks.
  3. Scalability: Well-optimized queries and appropriate indexing contribute to the scalability of a database, allowing it to handle increased workloads without a proportional decrease in performance.

--

--

The Java Trail

Scalable Distributed System, Backend Performance Optimization, Java Enthusiast. (mazumder.dip.auvi@gmail.com Or, +8801741240520)