简体   繁体   中英

Why is my SQL query so slow?

I run the following query on a weekly basis, but it is getting to the point where it now takes 22 hours to run! The purpose of the report is to aggregate impression and conversion data at the ad placement and date, so the main table I am querying does not have a primary key as there can be multiple events with the same date/placement.

The main data set has about 400K records, so it shouldn't take more than a few minutes to run this report.

The table descriptions are:

tbl_ads (400,000 records)

day_est     DATE (index)
conv_day_est    DATE (index)
placement_id    INT (index)
adunit_id   INT (index)
cost_type   VARCHAR(20)
cost_value  DECIMAL(10,2)
adserving_cost  DECIMAL(10,2)
conversion1 INT
estimated_spend DECIMAL(10,2)
clicks      INT
impressions INT
publisher_clicks    INT
publisher_impressions   INT
publisher_spend DECIMAL (10,2)
source VARCHAR(30)

map_external_id (75,000 records)

placement_id    INT
adunit_id   INT
external_id VARCHAR (50)
primary key(placement_id,adunit_id,external_id)

SQL Query

SELECT A.day_est,A.placement_id,A.placement_name,A.adunit_id,A.adunit_name,A.imp,A.clk, C.ads_cost, C.ads_spend, B.conversion1, B.conversion2,B.ID_Matched, C.pub_imps, C.pub_clicks, C.pub_spend, COALESCE(A.cost_type,B.cost_type) as cost_type, COALESCE(A.cost_value,B.cost_value) as cost_value, D.external_id
FROM (SELECT day_est, placement_id,adunit_id,placement_name,adunit_name,cost_type,cost_value,
    SUM(impressions) as imp, SUM(clicks) as clk
    FROM tbl_ads
    WHERE source='delivery'
    GROUP BY 1,2,3 ) as A LEFT JOIN
(
    SELECT conv_day_est, placement_id,adunit_id, cost_type,cost_value, SUM(conversion1) as conversion1,
    SUM(conversion2) as conversion2,SUM(id_match) as ID_Matched
    FROM tbl_ads
    WHERE source='attribution'
    GROUP BY 1,2,3
) as B on A.day_est=B.conv_day_est AND A.placement_id=B.placement_id AND A.adunit_id=B.adunit_id
LEFT JOIN
(
    SELECT day_est,placement_id,adunit_id,SUM(adserving_cost) as ads_cost, SUM(estimated_spend) as ads_spend,sum(publisher_clicks) as pub_clicks,sum(publisher_impressions) as pub_imps,sum(publisher_spend) as pub_spend
    FROM tbl_ads
    GROUP BY 1,2,3 ) as C on A.day_est=C.day_est AND A.placement_id=C.placement_id AND A.adunit_id=C.adunit_id
LEFT JOIN
(
    SELECT placement_id,adunit_id,external_id
    FROM map_external_id
) as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
INTO OUTFILE '/tmp/weekly_report.csv';

Results of EXPLAIN:

+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| id | select_type | table              | type  | possible_keys | key     | key_len | ref  | rows   | Extra          |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
|  1 | PRIMARY     | <derived2>         | ALL   | NULL          | NULL    | NULL    | NULL | 136518 |                |
|  1 | PRIMARY     | <derived3>         | ALL   | NULL          | NULL    | NULL    | NULL |   5180 |                |
|  1 | PRIMARY     | <derived4>         | ALL   | NULL          | NULL    | NULL    | NULL | 198190 |                |
|  1 | PRIMARY     | <derived5>         | ALL   | NULL          | NULL    | NULL    | NULL |  23766 |                |
|  5 | DERIVED     | map_external_id    | index | NULL          | PRIMARY | 55      | NULL |  20797 | Using index    |
|  4 | DERIVED     | tbl_ads            | index | NULL          | PIndex  | 13      | NULL | 318400 |                |
|  3 | DERIVED     | tbl_ads            | ALL   | NULL          | NULL    | NULL    | NULL | 318400 | Using filesort |
|  2 | DERIVED     | tbl_ads            | index | NULL          | PIndex  | 13      | NULL | 318400 | Using where    |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+

More of a speculative answer, but I don't think 22 hours is too unrealistic..

First things first... you don't need the last subquery, just state

LEFT JOIN map_external_id as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id

Second, in the first and second subqueries you have the field source in your WHERE statement and this field is not listed in your table scheme. Obviously it might be or enum or string type, does it have an index? I've had a table with 1'000'000 or so entries where a missing index caused a processing time of 30 seconds for a simple query (cant believe the guy who put the query in the login process).

Irrelevant question inbetween, what's the final result set size?

Thirdly, my assumption is that by running the aggregating subqueries mysql actually creates temporary tables that do not have any indices - which is bad. Have you yet had a look on the result sets of the single subqueries? What is the typical set size? From your statements and my guesses about your typical data I would assume that the aggregation actually only marginally reduces the set size (apart from the WHERE statement). So let me guess in order of the subqueries: 200'000, 100'000, 200'000

Each of the subqueries then joins with the next on three assumably not indexed fields. So worst case for the first join: 200'000 * 100'000 = 20'000'000'000 comparisons. Going from my 30 sec for a query on 1'000'000 records experience that makes it 20'000 * 30 = 600'000 sec =+- 166 hours. obviously that's way too much, maybe there's a digit missing, maybe it was 20 sec not 30, the result sets might be different, worst case is not average case - but you get the image.

My solution approach then would be to try to create additional tables which replace your aggregation subqueries. Judging from your queries you could update it daily, as I guess you just insert rows on impressions etc, so you can just add the aggregation data incrementally. Then you transform your mega-query into the two steps of

  1. updating your aggregation tables
  2. doing the final dump.

The aggregation tables obviously should be indexed meaningfully. I think that should bring the final queries down to a few seconds.

Thanks for all your advice. I ended up splitting the sub queries and creating temporary tables (with PKs) for each, then joined the temp tables together at the end and it now takes about 10 mins to run.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM