繁体   English   中英

Mysql子查询比加入快得多

[英]Mysql subquery much faster than join

我有以下查询,它们都返回相同的结果和行数:

select * from (
               select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime, 
                      hbrl.business_rule_id, 
                      display_advertiser_id, 
                      hbrl.campaign_id, 
                      truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network, 
                      sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view, 
                      sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click, 
                      truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent, 
                      truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
                      truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc 
               from hourly_business_rule_level hbrl
               where (publisher_network_id = 31534) 
               and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f') 
               and (network_time IS NOT NULL and display_advertiser_id > 0)
               group by network_time, hbrl.campaign_id, hbrl.business_rule_id
               having demand_ad_spend_network > 0
               OR demand_ad_view > 0
               OR demand_ad_click > 0
               OR ctr_percent > 0
               OR ecpm > 0
               OR ecpc > 0
               order by epoch_network_datetime) as atb
       left join dim_demand demand on atb.display_advertiser_id = demand.advertiser_dsp_id 
       and atb.campaign_id = demand.campaign_id 
       and atb.business_rule_id = demand.business_rule_id 

运行解释扩展,这些是结果:

+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| id | select_type | table                      | type | possible_keys                                                                 | key     | key_len | ref             | rows    | filtered | Extra                                        |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>                 | ALL  | NULL                                                                          | NULL    | NULL    | NULL            | 1451739 |   100.00 | NULL                                         |
|  1 | PRIMARY     | demand                     | ref  | PRIMARY,join_index                                                            | PRIMARY | 4       | atb.campaign_id |       1 |   100.00 | Using where                                  |
|  2 | DERIVED     | hourly_business_rule_level | ALL  | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL    | NULL    | NULL            | 1494447 |    97.14 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+

另一个是:

select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime, 
       hbrl.business_rule_id, 
       display_advertiser_id, 
       hbrl.campaign_id, 
       truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network, 
       sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view, 
       sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click, 
       truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent, 
       truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm, 
       truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc 
from hourly_business_rule_level hbrl
join dim_demand demand on hbrl.display_advertiser_id = demand.advertiser_dsp_id 
and hbrl.campaign_id = demand.campaign_id 
and hbrl.business_rule_id = demand.business_rule_id 
where (publisher_network_id = 31534) 
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f') 
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0 
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime;

这些是第二个查询的结果:

+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table                      | type | possible_keys                                                                 | key     | key_len | ref                                                           | rows    | filtered | Extra                                        |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
|  1 | SIMPLE      | hourly_business_rule_level | ALL  | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL    | NULL    | NULL                                                          | 1494447 |    97.14 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | demand                     | ref  | PRIMARY,join_index                                                            | PRIMARY | 4       | my6sense_datawarehouse.hourly_business_rule_level.campaign_id |       1 |   100.00 | Using where; Using index                     |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+

第一个需要大约2秒钟,而第二个需要2分钟!

为什么第二个查询需要这么长时间? 我在这里想念的是什么?

谢谢。

一个可能的原因是必须与第二个表连接的行数。

GROUP BY子句和HAVING子句将限制从子查询返回的行数。 只有那些行将用于连接。

如果没有子查询,则只有WHERE子句限制JOIN的行数。 JOIN在处理GROUP BY和HAVING子句之前完成。 根据组大小和HAVING条件的选择性,需要连接的行数要多得多。

请考虑以下简化示例:

我们有一个表users有1000个条目和列idemail

create table users(
    id smallint auto_increment primary key,
    email varchar(50) unique
);

然后我们有一个(巨大的)日志表user_actions其中包含1,000,000个条目以及列iduser_idtimestampaction

create table user_actions(
    id mediumint auto_increment primary key,
    user_id smallint not null,
    timestamp timestamp,
    action varchar(50),
    index (timestamp, user_id)
);

任务是查找自2017-02-01以来日志表中至少有900个条目的所有用户。

子查询解决方案:

select a.user_id, a.cnt, u.email
from (
    select a.user_id, count(*) as cnt
    from user_actions a
    where a.timestamp >= '2017-02-01 00:00:00'
    group by a.user_id
    having cnt >= 900
) a
left join users u on u.id = a.user_id

子查询返回135行(用户)。 只有那些行将与users表连接。 子查询运行大约0.375秒。 连接所需的时间几乎为零,因此完整查询的运行时间约为0.375秒。

没有子查询的解决方案:

select a.user_id, count(*) as cnt, u.email
from user_actions a
left join users u on u.id = a.user_id
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900

WHERE条件将表过滤为866,081行。 必须为所有这些866K行完成JOIN。 在JOIN之后处理GROUP BY和HAVING子句并将结果限制为135行。 此查询大约需要0.815秒。

所以你已经可以看到,子查询可以提高性能。

但是让我们把事情变得更糟,并将主键放在users表中。 这样我们就没有可用于JOIN的索引。 现在第一个查询在0.455秒内运行。 第二个查询需要40秒 - 几乎慢100倍

笔记

如果同样适用于您的情况,则很难说。 原因是:

  • 您的查询非常复杂,远离了MVCE
  • 我没有看到从demand表中选择任何东西 - 所以目前还不清楚你为什么要加入它。
  • 您在一个查询中使用LEFT JOIN,在另一个查询中使用INNER JOIN。
  • 两个表之间的关系尚不清楚。
  • 没有关于索引的信息。 您应该提供CREATE语句( SHOW CREATE table_name )。

测试设置

drop table if exists users;
create table users(
    id smallint auto_increment primary key,
    email varchar(50) unique
)
    select seq as id, rand(1) as email
    from seq_1_to_1000
;


drop table if exists user_actions;
create table user_actions(
    id mediumint auto_increment primary key,
    user_id smallint not null,
    timestamp timestamp,
    action varchar(50),
    index (timestamp, user_id)
)
    select seq as id
        , floor(rand(2)*1000)+1 as user_id
        #, '2017-01-01 00:00:00' + interval seq*20 second as timestamp
        , from_unixtime(unix_timestamp('2017-01-01 00:00:00') + seq*20) as timestamp
        , rand(3) as action
    from seq_1_to_1000000
;

带序列插件的MariaDB 10.0.19。

查询是不同的。 一个说JOIN ,另一个说LEFT JOIN 您没有使用demand ,因此连接可能没用。 但是,对于JOIN ,您要过滤掉不在dim_demand广告客户; 这意味着什么?

但这并没有解决这个问题。

EXPLAINs ,有150万行的估计hbrl 但是结果中出现了多少? 我猜它会少得多。 由此,我可以回答你的问题。

考虑这两个:

SELECT ... FROM ( SELECT ... FROM a
                      GROUP BY or HAVING or LIMIT ) x
           JOIN b

SELECT ... FROM a
           JOIN b
           GROUP BY or HAVING or LIMIT

第一个会减少需要加入b的行数; 第二个需要做一个完整的1.5M连接。 我怀疑进行JOIN所需的时间( LEFT或不LEFT )是差异所在。

计划A:从查询中删除demand

计划B:只要子查询显着缩小JOIN 之前的行数,就使用子查询。

索引(可以加快两种变体):

INDEX(publisher_network_id, network_time)

并且摆脱这个是无用的(因为对于NULLbetween无论如何都会失败):

and network_time IS NOT NULL

旁注:我建议简化并修复此问题

and  network_time
   between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f')
       AND str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')

and network_time >= '2017-08-13 17:00:00
and network_time  < '2017-08-13 17:00:00 + INTERVAL 24 HOUR

每当子查询显着缩小行数之前使用子查询 - 任何加入 - 总是强化Rick James Plan B.加强Rick&Paul的答案,你已经记录了。 里克和保罗的答案值得接受。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM