简体   繁体   English

Mysql子查询比加入快得多

[英]Mysql subquery much faster than join

I have the following queries which both return the same result and row count: 我有以下查询,它们都返回相同的结果和行数:

select * from (
               select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime, 
                      hbrl.business_rule_id, 
                      display_advertiser_id, 
                      hbrl.campaign_id, 
                      truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network, 
                      sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view, 
                      sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click, 
                      truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent, 
                      truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
                      truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc 
               from hourly_business_rule_level hbrl
               where (publisher_network_id = 31534) 
               and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f') 
               and (network_time IS NOT NULL and display_advertiser_id > 0)
               group by network_time, hbrl.campaign_id, hbrl.business_rule_id
               having demand_ad_spend_network > 0
               OR demand_ad_view > 0
               OR demand_ad_click > 0
               OR ctr_percent > 0
               OR ecpm > 0
               OR ecpc > 0
               order by epoch_network_datetime) as atb
       left join dim_demand demand on atb.display_advertiser_id = demand.advertiser_dsp_id 
       and atb.campaign_id = demand.campaign_id 
       and atb.business_rule_id = demand.business_rule_id 

ran explain extended, and these are the results: 运行解释扩展,这些是结果:

+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| id | select_type | table                      | type | possible_keys                                                                 | key     | key_len | ref             | rows    | filtered | Extra                                        |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>                 | ALL  | NULL                                                                          | NULL    | NULL    | NULL            | 1451739 |   100.00 | NULL                                         |
|  1 | PRIMARY     | demand                     | ref  | PRIMARY,join_index                                                            | PRIMARY | 4       | atb.campaign_id |       1 |   100.00 | Using where                                  |
|  2 | DERIVED     | hourly_business_rule_level | ALL  | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL    | NULL    | NULL            | 1494447 |    97.14 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+

and the other is: 另一个是:

select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime, 
       hbrl.business_rule_id, 
       display_advertiser_id, 
       hbrl.campaign_id, 
       truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network, 
       sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view, 
       sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click, 
       truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent, 
       truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm, 
       truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc 
from hourly_business_rule_level hbrl
join dim_demand demand on hbrl.display_advertiser_id = demand.advertiser_dsp_id 
and hbrl.campaign_id = demand.campaign_id 
and hbrl.business_rule_id = demand.business_rule_id 
where (publisher_network_id = 31534) 
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f') 
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0 
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime;

and these are the results for the second query: 这些是第二个查询的结果:

+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table                      | type | possible_keys                                                                 | key     | key_len | ref                                                           | rows    | filtered | Extra                                        |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
|  1 | SIMPLE      | hourly_business_rule_level | ALL  | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL    | NULL    | NULL                                                          | 1494447 |    97.14 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | demand                     | ref  | PRIMARY,join_index                                                            | PRIMARY | 4       | my6sense_datawarehouse.hourly_business_rule_level.campaign_id |       1 |   100.00 | Using where; Using index                     |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+

the first one takes about 2 seconds while the second one takes over 2 minutes! 第一个需要大约2秒钟,而第二个需要2分钟!

why is the second query taking so long? 为什么第二个查询需要这么长时间? what am I missing here? 我在这里想念的是什么?

thanks. 谢谢。

One possible reason is the number of rows that have to be joined with the second table. 一个可能的原因是必须与第二个表连接的行数。

The GROUP BY clause and the HAVING clause will limit the number of rows returned from your subquery. GROUP BY子句和HAVING子句将限制从子查询返回的行数。 Only those rows will be used for the join. 只有那些行将用于连接。

Without the subquery only the WHERE clause is limiting the number of rows for the JOIN. 如果没有子查询,则只有WHERE子句限制JOIN的行数。 The JOIN is done before the GROUP BY and HAVING clauses are processed. JOIN在处理GROUP BY和HAVING子句之前完成。 Depending on group size and the selectivity of the HAVING conditions there would be much more rows that need to be joined. 根据组大小和HAVING条件的选择性,需要连接的行数要多得多。

Consider the following simplified example: 请考虑以下简化示例:

We have a table users with 1000 entries and the columns id , email . 我们有一个表users有1000个条目和列idemail

create table users(
    id smallint auto_increment primary key,
    email varchar(50) unique
);

Then we have a (huge) log table user_actions with 1,000,000 entries and the columns id , user_id , timestamp , action 然后我们有一个(巨大的)日志表user_actions其中包含1,000,000个条目以及列iduser_idtimestampaction

create table user_actions(
    id mediumint auto_increment primary key,
    user_id smallint not null,
    timestamp timestamp,
    action varchar(50),
    index (timestamp, user_id)
);

The task is to find all users who have at least 900 entries in the log table since 2017-02-01. 任务是查找自2017-02-01以来日志表中至少有900个条目的所有用户。

The subquery solution: 子查询解决方案:

select a.user_id, a.cnt, u.email
from (
    select a.user_id, count(*) as cnt
    from user_actions a
    where a.timestamp >= '2017-02-01 00:00:00'
    group by a.user_id
    having cnt >= 900
) a
left join users u on u.id = a.user_id

The subquery returns 135 rows (users). 子查询返回135行(用户)。 Only those rows will be joined with the users table. 只有那些行将与users表连接。 The subquery runs in about 0.375 seconds. 子查询运行大约0.375秒。 The time needed for the join is almost zero, so the full query runs in about 0.375 seconds. 连接所需的时间几乎为零,因此完整查询的运行时间约为0.375秒。

Solution without subquery: 没有子查询的解决方案:

select a.user_id, count(*) as cnt, u.email
from user_actions a
left join users u on u.id = a.user_id
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900

The WHERE condition filters the table to 866,081 rows. WHERE条件将表过滤为866,081行。 The JOIN has to be done for all those 866K rows. 必须为所有这些866K行完成JOIN。 After the JOIN the GROUP BY and the HAVING clauses are processed and limit the result to 135 rows. 在JOIN之后处理GROUP BY和HAVING子句并将结果限制为135行。 This query needs about 0.815 seconds. 此查询大约需要0.815秒。

So you can already see, that a subquery can improve the performance. 所以你已经可以看到,子查询可以提高性能。

But let's make things worse and drop the primary key in the users table. 但是让我们把事情变得更糟,并将主键放在users表中。 This way we have no index which can be used for the JOIN. 这样我们就没有可用于JOIN的索引。 Now the first query runs in 0.455 seconds. 现在第一个查询在0.455秒内运行。 The second query needs 40 seconds - almost 100 times slower . 第二个查询需要40秒 - 几乎慢100倍

Notes 笔记

It's difficult to say if the same applies to your case. 如果同样适用于您的情况,则很难说。 Reasons are: 原因是:

  • Your queries are quite complex and far away from from beeing an MVCE . 您的查询非常复杂,远离了MVCE
  • I don't see anything beeng selected from the demand table - So it's unclear why you are joining it at all. 我没有看到从demand表中选择任何东西 - 所以目前还不清楚你为什么要加入它。
  • You use a LEFT JOIN in one query and an INNER JOIN in another one. 您在一个查询中使用LEFT JOIN,在另一个查询中使用INNER JOIN。
  • The relation between the two tables is unclear. 两个表之间的关系尚不清楚。
  • No information about indexes. 没有关于索引的信息。 You should provide the CREATE statements ( SHOW CREATE table_name ). 您应该提供CREATE语句( SHOW CREATE table_name )。

Test setup 测试设置

drop table if exists users;
create table users(
    id smallint auto_increment primary key,
    email varchar(50) unique
)
    select seq as id, rand(1) as email
    from seq_1_to_1000
;


drop table if exists user_actions;
create table user_actions(
    id mediumint auto_increment primary key,
    user_id smallint not null,
    timestamp timestamp,
    action varchar(50),
    index (timestamp, user_id)
)
    select seq as id
        , floor(rand(2)*1000)+1 as user_id
        #, '2017-01-01 00:00:00' + interval seq*20 second as timestamp
        , from_unixtime(unix_timestamp('2017-01-01 00:00:00') + seq*20) as timestamp
        , rand(3) as action
    from seq_1_to_1000000
;

MariaDB 10.0.19 with sequence plugin. 带序列插件的MariaDB 10.0.19。

The queries are different. 查询是不同的。 One says JOIN , the other says LEFT JOIN . 一个说JOIN ,另一个说LEFT JOIN You are not using demand , so the join is probably useless. 您没有使用demand ,因此连接可能没用。 However, in the case of JOIN , you are filtering out advertisers that are not in dim_demand ; 但是,对于JOIN ,您要过滤掉不在dim_demand广告客户; it that the intent? 这意味着什么?

But that does not address the question. 但这并没有解决这个问题。

The EXPLAINs estimate that there are 1.5M rows in hbrl . EXPLAINs ,有150万行的估计hbrl But how many show up in the result? 但是结果中出现了多少? I would guess it is a lot fewer. 我猜它会少得多。 From this, I can answer your question. 由此,我可以回答你的问题。

Consider these two: 考虑这两个:

SELECT ... FROM ( SELECT ... FROM a
                      GROUP BY or HAVING or LIMIT ) x
           JOIN b

SELECT ... FROM a
           JOIN b
           GROUP BY or HAVING or LIMIT

The first will decrease the number of rows that need to join to b ; 第一个会减少需要加入b的行数; the second will need to do a full 1.5M joins. 第二个需要做一个完整的1.5M连接。 I suspect that the time taken to do the JOIN (be it LEFT or not) is where the difference is. 我怀疑进行JOIN所需的时间( LEFT或不LEFT )是差异所在。

Plan A: Remove demand from the query. 计划A:从查询中删除demand

Plan B: Use a subquery whenever the subquery significantly shrinks the number of rows before the JOIN . 计划B:只要子查询显着缩小JOIN 之前的行数,就使用子查询。

Indexing (may speed up both variants): 索引(可以加快两种变体):

INDEX(publisher_network_id, network_time)

and get rid of this as being useless (since the between will fail anyway for NULL ): 并且摆脱这个是无用的(因为对于NULLbetween无论如何都会失败):

and network_time IS NOT NULL

Side note: I recommend simplifying and fixing this 旁注:我建议简化并修复此问题

and  network_time
   between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f')
       AND str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')

to

and network_time >= '2017-08-13 17:00:00
and network_time  < '2017-08-13 17:00:00 + INTERVAL 24 HOUR

Use a subquery whenever the subquery significantly shrinks the number of rows before - ANY JOIN - always to reinforce Rick James Plan B. To reinforce Rick & Paul's answer which you have already documented. 每当子查询显着缩小行数之前使用子查询 - 任何加入 - 总是强化Rick James Plan B.加强Rick&Paul的答案,你已经记录了。 The answers by Rick and Paul deserve Acceptance. 里克和保罗的答案值得接受。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM