简体繁体 English

HIVE-桶联接的用例是什么

[英]HIVE - what are the use cases for a bucket join

原文 2013-06-19 16:04:38 1 1 join/ hadoop/ hive/ buckets

I can't seem to find any good use case for a bucket join in hive. 我似乎找不到在蜂巢中进行桶连接的任何好用例。
As i see it, When joining table A with table B : 如我所见，当将表A与表B连接时：
A bucket join saves us the time of passing Table A to the reducers while loading Table B into the distributed cache and each mapper processes the corresponding bucket of Table B vs the bucket of Table A. 桶联接节省了我们将表A加载到分布式缓存中时将表A传递给reducer的时间，并且每个映射器都处理表B和表A的相应桶。

But, the loading of Table B into the distributed cache is done by a single task thus as the table get gets bigger this becomes a bottleneck. 但是，将表B加载到分布式缓存中是由单个任务完成的，因此，随着表变大，这将成为瓶颈。
So, If table B is small enough not to burden a single task its practically the same as doing a regular map-join with a small optimization. 因此，如果表B足够小以至于不能负担单个任务，则它实际上与进行常规优化的较小映射联接相同。

On the other hand if table B can't fit into a single mapper has a whole, the process of reading it to the distributed cache could take a while. 另一方面，如果表B无法容纳单个映射器，并且具有一个整体，则将其读取到分布式缓存的过程可能需要一段时间。

Finally, it seems that the time to load table B into the distributed cache might be worth it because we don't need to pass the buckets of table A from the mappers to the reducers but this process shouldn't be too heavy unless table A is really big, because each mapper would read a single bucket that corresponds to a single reducer (the tables are bucketed by the join key) each reducer fetches 2 intermediate outputs (one for each table, not bad chance that the reducer is running on the same node as its corresponding mapper) and merges them and from this point the join is the same as in the mappers. 最后，似乎需要花费时间将表B加载到分布式缓存中，因为我们不需要将表A的存储区从映射器传递给化简器，但是除非表A如此，否则该过程不会太繁琐确实很大，因为每个映射器都会读取对应于单个化简器的单个存储桶（表由连接键存储），每个化简器获取2个中间输出（每个表一个中间输出，这是相当不错的机会，即该减速器在运行与其对应的映射器位于同一节点）并将其合并，从这一点开始，联接与映射器中的相同。

To conclude, I think the question is what costs more : 总而言之，我认为问题是成本更高：

Loading a moderate size table into the distributed cache by a single task 通过单个任务将中等大小的表加载到分布式缓存中
Passing a lot of moderate (maybe big) size buckets from the mappers to the reducers (mostly locally) and merging 2 files - all done in parallel. 将大量中等大小（可能很大）的存储桶从映射器传递到化简器（主要在本地）并合并2个文件-所有这些都是并行完成的。

What do you think? 你怎么看？ Can someone find a good usage to bucket join? 有人可以找到很好的用法来加入桶吗？

1 个解决方案

I think you're confusing bucket join with a mapjoin. 我认为您将bucket join与mapjoin混淆了。 In the map join, the smaller table is loaded into the distributed cache, assuming it's small enough, and it is send to all the mappers. 在映射联接中，假设较小的表足够小，则将其加载到分布式缓存中，并将其发送给所有映射器。 There's a 1 to N correspondence. 有一个1到N的对应关系。

In a bucket join, you're joining two large tables both of which store the data in the same way: in N buckets (files), bucketed and sorted by the same column you're joining. 在存储桶联接中，您要联接两个大表，两个大表都以相同的方式存储数据：在N个存储桶（文件）中，按要联接的同一列进行存储和排序。 So table A has N buckets, table B has N buckets too, so you can mergesort bucket #1 of A with bucket #1 of B, #2 with #2 etc. It's a 1 to 1 correspondence , N times. 因此，表A具有N个存储桶，表B也具有N个存储桶，因此您可以将A的存储桶＃1与B的存储桶＃1，＃2与＃2进行合并排序。这是1到1的对应关系，N次。 This is also done on the map side, but the distributed cache is not involved. 这也在地图端完成，但是不涉及分布式缓存。