[英]Mysql subquery much faster than join
我有以下查詢,它們都返回相同的結果和行數:
select * from (
select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime,
hbrl.business_rule_id,
display_advertiser_id,
hbrl.campaign_id,
truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network,
sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view,
sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc
from hourly_business_rule_level hbrl
where (publisher_network_id = 31534)
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime) as atb
left join dim_demand demand on atb.display_advertiser_id = demand.advertiser_dsp_id
and atb.campaign_id = demand.campaign_id
and atb.business_rule_id = demand.business_rule_id
運行解釋擴展,這些是結果:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1451739 | 100.00 | NULL |
| 1 | PRIMARY | demand | ref | PRIMARY,join_index | PRIMARY | 4 | atb.campaign_id | 1 | 100.00 | Using where |
| 2 | DERIVED | hourly_business_rule_level | ALL | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL | NULL | NULL | 1494447 | 97.14 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
另一個是:
select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime,
hbrl.business_rule_id,
display_advertiser_id,
hbrl.campaign_id,
truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network,
sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view,
sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc
from hourly_business_rule_level hbrl
join dim_demand demand on hbrl.display_advertiser_id = demand.advertiser_dsp_id
and hbrl.campaign_id = demand.campaign_id
and hbrl.business_rule_id = demand.business_rule_id
where (publisher_network_id = 31534)
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime;
這些是第二個查詢的結果:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| 1 | SIMPLE | hourly_business_rule_level | ALL | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL | NULL | NULL | 1494447 | 97.14 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | demand | ref | PRIMARY,join_index | PRIMARY | 4 | my6sense_datawarehouse.hourly_business_rule_level.campaign_id | 1 | 100.00 | Using where; Using index |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
第一個需要大約2秒鍾,而第二個需要2分鍾!
為什么第二個查詢需要這么長時間? 我在這里想念的是什么?
謝謝。
一個可能的原因是必須與第二個表連接的行數。
GROUP BY子句和HAVING子句將限制從子查詢返回的行數。 只有那些行將用於連接。
如果沒有子查詢,則只有WHERE子句限制JOIN的行數。 JOIN在處理GROUP BY和HAVING子句之前完成。 根據組大小和HAVING條件的選擇性,需要連接的行數要多得多。
請考慮以下簡化示例:
我們有一個表users
有1000個條目和列id
, email
。
create table users(
id smallint auto_increment primary key,
email varchar(50) unique
);
然后我們有一個(巨大的)日志表user_actions
其中包含1,000,000個條目以及列id
, user_id
, timestamp
, action
create table user_actions(
id mediumint auto_increment primary key,
user_id smallint not null,
timestamp timestamp,
action varchar(50),
index (timestamp, user_id)
);
任務是查找自2017-02-01以來日志表中至少有900個條目的所有用戶。
select a.user_id, a.cnt, u.email
from (
select a.user_id, count(*) as cnt
from user_actions a
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900
) a
left join users u on u.id = a.user_id
子查詢返回135行(用戶)。 只有那些行將與users
表連接。 子查詢運行大約0.375秒。 連接所需的時間幾乎為零,因此完整查詢的運行時間約為0.375秒。
select a.user_id, count(*) as cnt, u.email
from user_actions a
left join users u on u.id = a.user_id
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900
WHERE條件將表過濾為866,081行。 必須為所有這些866K行完成JOIN。 在JOIN之后處理GROUP BY和HAVING子句並將結果限制為135行。 此查詢大約需要0.815秒。
所以你已經可以看到,子查詢可以提高性能。
但是讓我們把事情變得更糟,並將主鍵放在users
表中。 這樣我們就沒有可用於JOIN的索引。 現在第一個查詢在0.455秒內運行。 第二個查詢需要40秒 - 幾乎慢100倍 。
如果同樣適用於您的情況,則很難說。 原因是:
demand
表中選擇任何東西 - 所以目前還不清楚你為什么要加入它。 SHOW CREATE table_name
)。 drop table if exists users;
create table users(
id smallint auto_increment primary key,
email varchar(50) unique
)
select seq as id, rand(1) as email
from seq_1_to_1000
;
drop table if exists user_actions;
create table user_actions(
id mediumint auto_increment primary key,
user_id smallint not null,
timestamp timestamp,
action varchar(50),
index (timestamp, user_id)
)
select seq as id
, floor(rand(2)*1000)+1 as user_id
#, '2017-01-01 00:00:00' + interval seq*20 second as timestamp
, from_unixtime(unix_timestamp('2017-01-01 00:00:00') + seq*20) as timestamp
, rand(3) as action
from seq_1_to_1000000
;
帶序列插件的MariaDB 10.0.19。
查詢是不同的。 一個說JOIN
,另一個說LEFT JOIN
。 您沒有使用demand
,因此連接可能沒用。 但是,對於JOIN
,您要過濾掉不在dim_demand
廣告客戶; 這意味着什么?
但這並沒有解決這個問題。
該EXPLAINs
,有150萬行的估計hbrl
。 但是結果中出現了多少? 我猜它會少得多。 由此,我可以回答你的問題。
考慮這兩個:
SELECT ... FROM ( SELECT ... FROM a
GROUP BY or HAVING or LIMIT ) x
JOIN b
SELECT ... FROM a
JOIN b
GROUP BY or HAVING or LIMIT
第一個會減少需要加入b
的行數; 第二個需要做一個完整的1.5M連接。 我懷疑進行JOIN
所需的時間( LEFT
或不LEFT
)是差異所在。
計划A:從查詢中刪除demand
。
計划B:只要子查詢顯着縮小JOIN
之前的行數,就使用子查詢。
索引(可以加快兩種變體):
INDEX(publisher_network_id, network_time)
並且擺脫這個是無用的(因為對於NULL
, between
無論如何都會失敗):
and network_time IS NOT NULL
旁注:我建議簡化並修復此問題
and network_time
between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f')
AND str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
至
and network_time >= '2017-08-13 17:00:00
and network_time < '2017-08-13 17:00:00 + INTERVAL 24 HOUR
每當子查詢顯着縮小行數之前使用子查詢 - 任何加入 - 總是強化Rick James Plan B.加強Rick&Paul的答案,你已經記錄了。 里克和保羅的答案值得接受。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.