简体繁体 English

在MapReduce中调用完全合并器时？

[英]When Exactly Combiner is called in MapReduce?

原文 2015-07-07 05:45:19 1 4 hadoop/ mapreduce/ combiners

Combiners are made using same class as reducer and mostly same code. 组合器使用与reducer相同的类以及几乎相同的代码制成。 But question when exactly it is called before sort and shuffle or before reduce when? 但是，问什么时候在排序和洗牌之前或在减少之前何时准确调用它？ If before sort and shuffle ie, just after mapper then how it will get input as [key, list<values>] ? 如果在sort和shuffle之前，即在mapper之后，那么它将如何以[key, list<values>]作为输入？ as this is given by sort and shuffle. 因为这是通过排序和洗牌给出的。 Now if it is called after sort and shuffle ie, just before reducer then output to combiner is [key, value] like reducer then how reducer will get input as [key, list<values>] ? 现在，如果在排序和混洗之后调用它，即恰好在reducer之前，然后输出到Combiner的是[key, value] reduce [key, value]类的[key, value]那么reducer将如何获得[key, list<values>]作为输入？

4 个解决方案

Combiner is like a pre-reducer, which will be applied soon after the map phase before sort and shuffle phase. 合并器就像一个预缩减器，它将在映射阶段之后不久进行排序和混洗阶段之前应用。

It will be applied on the same host where map phase is processed, minimising data transfer across network for next phase of processing(sort-shuffle and reduce). 它将应用于处理地图阶段的同一主机上，从而最大程度地减少了网络在下一阶段的处理中进行的数据传输（排序和减少）。

Because of this optimization of using the combiner, actual reducer phase will have less processing burden, results in better performance. 由于使用组合器的这种优化，实际的减速器阶段将具有较少的处理负担，从而获得更好的性能。

It's actually, after map phase and before sort and shuffle. 实际上是在地图阶段之后，排序和洗牌之前。 After the map phase, output will be pipelined for the next sort and shuffle phase, Combiner acts before that sort and shuffle phase. 在映射阶段之后，输出将通过管道传递到下一个排序和混洗阶段，Combiner在该排序和混洗阶段之前起作用。 It's like, Map->Combiner->Sort n Shuffle -> Reducer 就像是Map-> Combiner-> Sort n Shuffle-> Reducer

Output types of a combiner must match output types of a mapper. 组合器的输出类型必须与映射器的输出类型匹配。 Hadoop makes no guarantees on how many times the combiner is applied, or that it is even applied at all. Hadoop无法保证组合器被应用了多少次，甚至根本没有被应用。

If your mapper extends Mapper< K1, V1, K2, V2 > and your reducer extends 如果您的映射器扩展了Mapper< K1, V1, K2, V2 >并且化简器扩展了
Reducer< K2, V2, K3, V3 > , then the combiner must be an extension of Reducer< K2, V2, K3, V3 > ，则组合器必须是的扩展
Reducer< K2, V2, K2, V2 > . Reducer< K2, V2, K2, V2 > 。

Combiner is applied at the same machine as the map operation. Combiner与map操作应用在同一台机器map 。 Definitely before shuffle. 绝对在洗牌之前。

As referred to the Hadoop documentation: 如Hadoop文档所述：

When the map operation outputs its pairs they are already available in memory. 映射操作输出其对时，它们已在内存中可用。 For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying a combiner class to perform a reduce-type function. 出于效率的原因，有时通过提供一个组合器类来执行reduce类型的功能来利用这一事实是有意义的。 If a combiner is used then the map key-value pairs are not immediately written to the output. 如果使用组合器，则映射键值对不会立即写入输出。 Instead they will be collected in lists, one list per each key value. 而是将它们收集在列表中，每个键值一个列表。 When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation. 写入一定数量的键-值对后，通过将每个键的所有值传递到组合器的reduce方法并输出合并操作的键-值对，就好像它们是由原始映射创建的一样，刷新此缓冲区。操作。

http://wiki.apache.org/hadoop/HadoopMapReduce http://wiki.apache.org/hadoop/HadoopMapReduce

The Map Reduce framework will not call the combiner all the time even though you write the custom Combiner. 即使您编写自定义的合并器，Map Reduce框架也不会一直调用合并器。 it will call the combiner for surely if number of spills is at least 3 (default). 如果溢出数量至少为3（默认值），它将确定调用组合器。 you can configure, the number of spills for which a combiner need to run can be set through min.num.splits.for.combine property. 您可以配置，可以通过min.num.splits.for.combine属性设置需要运行合成器的溢出数量。