Learn Advanced: Interpreting Complex Plans | Real-World Query Optimization Scenarios

Swipe to show menu

When you work with complex SQL queries involving multiple joins, nested subqueries, and derived tables, the EXPLAIN output can appear intimidating at first glance. Understanding how to break down and interpret these advanced query plans is crucial for effective optimization. Complex plans often arise when you need to combine data from several related tables, filter results using subqueries, or generate temporary result sets (derived tables) to simplify logic or improve performance.

A derived table is a subquery in the FROM clause that acts like a temporary table for the duration of the query. Derived tables help encapsulate logic such as aggregations or filtering that you want to reuse or join with other tables. In EXPLAIN output, a derived table usually appears as a separate step, often labeled as "Subquery" or "Derived", and you can see how the planner executes it before joining it with other tables.

To see how this works, consider a query that retrieves each driver's name, the number of completed rides they have given, and their average rating, but only for drivers with more than one completed ride. This query uses a derived table to first aggregate ride counts and average ratings per driver, then joins it back to the drivers table:


              123456789101112131415161718
            
EXPLAIN
SELECT d.first_name, d.last_name, stats.ride_count, stats.avg_rating
FROM drivers d
JOIN (
    SELECT r.driver_id, COUNT(*) AS ride_count, AVG(rt.rating) AS avg_rating
    FROM rides r
    JOIN ratings rt ON r.ride_id = rt.ride_id
    WHERE r.status = 'completed'
    GROUP BY r.driver_id
    HAVING COUNT(*) > 1
) AS stats ON d.driver_id = stats.driver_id;
-- Example EXPLAIN output (format may vary by SQL dialect):
-- | id | select_type | table   | type   | possible_keys | key     | rows | Extra                                      |
-- |----|-------------|---------|--------|--------------|---------|------|---------------------------------------------|
-- |  1 | PRIMARY     | <derived2> | ALL  | NULL         | NULL    |    2 |                                             |
-- |  1 | PRIMARY     | d       | eq_ref | PRIMARY      | PRIMARY |    1 |                                             |
-- |  2 | DERIVED     | r       | ref    | driver_id    | driver_id |   3 | Using where; Using temporary; Using filesort|
-- |  2 | DERIVED     | rt      | ref    | ride_id      | ride_id |   2 | Using where                                 |

To interpret this plan, start from the innermost operations. The derived table (labeled as DERIVED or Subquery) groups rides by driver_id and calculates both the ride count and average rating. Notice that the planner shows the join between rides and ratings inside the subquery, using the ride_id key. The use of "Using temporary" and "Using filesort" in the Extra column indicates that the database engine will create a temporary table and may sort data to compute the GROUP BY and HAVING clauses.

After the derived table is built, the main query joins it to the drivers table using the primary key. The EXPLAIN output shows this join as a step with eq_ref, meaning the join uses a unique index lookup, which is efficient.

Breaking down a complex plan like this involves:

Identifying subqueries and derived tables and understanding what result sets they produce;
Tracing how each join is performed and which indexes are used;
Noting any steps that require temporary tables or filesorts, as these can be performance bottlenecks;
Observing the row estimates for each step to see where the most data is processed.

Suppose you want to optimize this query further. You might decide to add an index on ratings.ride_id or rewrite the query to avoid unnecessary filesorts. Compare the EXPLAIN output before and after optimization to see the impact.


              1234567891011121314151617181920212223242526272829
            
-- Before optimization (no additional indexes)
EXPLAIN
SELECT d.first_name, d.last_name, stats.ride_count, stats.avg_rating
FROM drivers d
JOIN (
    SELECT r.driver_id, COUNT(*) AS ride_count, AVG(rt.rating) AS avg_rating
    FROM rides r
    JOIN ratings rt ON r.ride_id = rt.ride_id
    WHERE r.status = 'completed'
    GROUP BY r.driver_id
    HAVING COUNT(*) > 1
) AS stats ON d.driver_id = stats.driver_id;

-- After optimization (add index on ratings.ride_id)
CREATE INDEX idx_ratings_ride_id ON ratings(ride_id);

EXPLAIN
SELECT d.first_name, d.last_name, stats.ride_count, stats.avg_rating
FROM drivers d
JOIN (
    SELECT r.driver_id, COUNT(*) AS ride_count, AVG(rt.rating) AS avg_rating
    FROM rides r
    JOIN ratings rt ON r.ride_id = rt.ride_id
    WHERE r.status = 'completed'
    GROUP BY r.driver_id
    HAVING COUNT(*) > 1
) AS stats ON d.driver_id = stats.driver_id;

-- Notice that after adding the index, the EXPLAIN output may show "Using index" for the join, and the number of rows examined in the ratings table may decrease, indicating a more efficient plan.

By comparing the two EXPLAIN outputs, you can see how the addition of an index can reduce the need for temporary tables or filesorts and speed up the join between rides and ratings. When faced with a complex plan, always look for steps with high row counts or expensive operations like filesorts, and consider whether indexes or query rewrites can help.

To check your understanding of these concepts, consider the following questions.

1. What is a derived table, and how does it appear in EXPLAIN output?

2. How can you simplify a complex query plan for easier optimization?

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 5