简体   繁体   English

为什么在Hive中计数(明显)慢于group by?

[英]Why is count(distinct) slower than group by in Hive?

On Hive, I believe count(distinct) will be more likely than group-by to result in an unbalanced workload to reducers and end up with one sad reducer grinding away. 在Hive上,我相信count(不同)将比group-by更可能导致减速器的工作量不平衡,并最终导致一个悲伤的减速器磨损。 Example query below. 下面的示例查询。

Why? 为什么?

Example query: 示例查询:

select count(distinct user)
from some_table

Version with group-by (proposed as faster): 分组版本(建议更快):

select count(*) from
(select user
 from some_table
 group by user) q

Note: slide 26 of this presentation describes the problem. 注意: 本演示文稿的幻灯片26描述了该问题。

select count(distinct user)
from some_table;

This query does the count on the map side. 此查询在地图方面进行计数。 Each mapper emits one value, the count. 每个映射器都会发出一个值,即计数。 Then all values have to be aggregated to produce the total count, and that is the job of one single reducer. 然后必须聚合所有值以产生总计数,这是单个减速器的工作。

select count(*) from
(select user
 from some_table
 group by user) q;

This query has two stages. 此查询有两个阶段。 On stage 1 the GROUP BY aggregates the users on the map side and emits one value for each user . 在阶段1,GROUP BY聚合地图侧的用户并为每个用户发出一个值。 The output has to be aggregated then on the reduce side, but it can use many reducers . 输出必须在reduce侧汇总, 但它可以使用许多reducer On stage 2 the the COUNT is performed, on the map side, and then the final result is aggregated using one single reducer. 在阶段2,在地图一侧执行COUNT ,然后使用一个减速器聚合最终结果。

So if you have a very large number of map side splits then the first query will have to aggregate a very large number of one value results. 因此,如果您有大量的地图侧分割,那么第一个查询将必须聚合非常大量的一个值结果。 The second query can use many reducers at the reduce side of stage 1 and then, at stage 2, will have a smaller task for the lone reducer at the end. 第二个查询可以在阶段1的缩减侧使用许多减速器,然后在阶段2,对于最后的单个减速器将具有较小的任务。

This would normally not be an optimization. 这通常不是优化。 You would have to have a significant number of map splits for the query 1 reducer to become a problem. 您必须有大量的地图拆分才能使查询1 reducer成为问题。 The second query has two stages and that alone would be slower than query 1 (stage 2 cannot start until stage 1 is completely done). 第二个查询有两个阶段,仅此一个将比查询1慢(阶段2在第1阶段完全完成之前无法启动)。 So, while I can see some reasoning for the advice you got, I would be skeptical unless proper measurement is done and shows improvement. 所以,虽然我可以看到你得到的建议的一些推理,但我会怀疑,除非进行适当的测量并显示出改进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM