Hadoop组合器排序阶段

Question

When running a MapReduce job with a specified combiner, is the combiner run during the sort phase? 使用指定的组合器运行MapReduce作业时，组合器是否在排序阶段运行？ I understand that the combiner is run on mapper output for each spill, but it seems like it would also be beneficial to run during intermediate steps when merge sorting. 我知道组合器在每个溢出的mapper输出上运行，但似乎在合并排序的中间步骤中运行也是有益的。 I'm assuming here that in some stages of the sort, mapper output for some equivalent keys is held in memory at some point. 我在这里假设在排序的某些阶段，某些等效键的映射器输出在某些时候保存在内存中。

If this doesn't currently happen, is there a particular reason, or just something which hasn't been implemented? 如果目前没有这种情况，是否有特殊原因，或者只是尚未实施的内容？

Thanks in advance! 提前致谢！

Answer 1

Combiners are there to save network bandwidth. 组合器可以节省网络带宽。

The mapoutput directly gets sorted: mapoutput直接排序：

sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);

This happens right after the real mapping is done. 这在实际映射完成后立即发生。 During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. 在通过缓冲区的迭代期间，它检查是否已经设置了组合器，如果是，则组合记录。 If not, it directly spills onto disk. 如果没有，它会直接溢出到磁盘上。

The important parts are in the MapTask , if you'd like to see it for yourself. 如果你想亲自看看它，那么重要的部分就在MapTask 。

    sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
    // some fields
    for (int i = 0; i < partitions; ++i) {
        // check if configured
        if (combinerRunner == null) {
          // spill directly
        } else {
            combinerRunner.combine(kvIter, combineCollector);
        }
    }

This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered. 这是保存磁盘空间和网络带宽的正确阶段，因为很可能必须传输输出。 During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time. 在合并/混洗/排序阶段，它没有用处，因为与地图结束时的组合器运行相比，你必须处理更多的数据量。

Note the sort-phase which is shown in the web interface is misleading. 请注意，Web界面中显示的排序阶段具有误导性。 It is just pure merging. 这只是纯粹的融合。

Answer 2

There are two opportunities for running the Combiner, both on the map side of processing. 在处理的地图方面，有两个运行Combiner的机会。 (A very good online reference is from Tom White's "Hadoop: The Definitive Guide" - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort ) （一个非常好的在线参考来自Tom White的“Hadoop：The Definitive Guide” - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-排序）

The first opportunity comes on the map side after completing the in-memory sort by key of each partition, and before writing those sorted data to disk. 在完成每个分区的密钥的内存中排序之后，以及在将这些已排序的数据写入磁盘之前，第一个机会来自映射端。 The motivation for running the Combiner at this point is to reduce the amount of data ultimately written to local storage. 此时运行Combiner的动机是减少最终写入本地存储的数据量。 By running the Combiner here, we also reduce the amount of data that will need to be merged and sorted in the next step. 通过在此处运行Combiner，我们还可以减少在下一步中需要合并和排序的数据量。 So to the original question posted, yes, the Combiner is already being applied at this early step. 所以对于发布的原始问题，是的，Combiner已经在这个早期步骤中应用了。

The second opportunity comes right after merging and sorting the spill files. 第二次机会在合并和排序溢出文件后立即出现。 In this case, the motivation for running the Combiner is to reduce the amount of data ultimately sent over the network to the reducers. 在这种情况下，运行Combiner的动机是减少最终通过网络发送到Reducer的数据量。 This stage benefits from the earlier application of the Combiner, which may have already reduced the amount of data to be processed by this step. 此阶段受益于Combiner的早期应用，这可能已经减少了此步骤要处理的数据量。

Answer 3

The combiner is only going to run how you understand it. 组合器只会运行你如何理解它。

I suspect the reason that the combiner only works in this way is that it reduces the amount of data being sent to the reducers. 我怀疑组合器仅以这种方式工作的原因是它减少了发送到reducer的数据量。 This is a huge gain in many situations. 在许多情况下，这是一个巨大的收获。 Meanwhile, in the reducer, the data is already there, and whether you combine them in the sort/merge or in your reduce logic is not really going to matter computationally (it's either done now or later). 同时，在reducer中，数据已经存在，并且无论是在排序/合并中还是在reduce逻辑中将它们组合在一起，实际上并不重要（它可以在现在或以后完成）。

So, I guess my point is: you may get gains by combining like you say in the merge, but it's not going to be as much as the map-side combiner. 所以，我想我的观点是：你可能会像你在合并中所说的那样通过组合获得收益，但它不会像地图侧组合器那样多。

Answer 4

I haven't gone through the code but in reference to Hadoop : The definitive guide by Tom White 3rd edition, it does mention that if the combiner is specified it will run during the merge phase in the reducer. 我没有查看代码，但是参考了Hadoop：Tom White第3版的权威指南，它确实提到如果指定了组合器，它将在reducer的合并阶段运行。 Following is excerpt from the text: 以下摘自文字：

" The map outputs are copied to the reduce task JVM's memory if they are small enough (the buffer's size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk . " “如果映射输出足够小（映射缓冲区的大小由mapred.job.shuffle.input.buffer.percent控制，它指定用于此目的的堆的比例），则将映射输出复制到reduce任务JVM的内存中。否则，它们被复制到磁盘。当内存缓冲区达到阈值大小（由mapred.job.shuffle.merge.percent控制），或达到阈值数量的地图输出（mapred.inmem.merge.threshold）时，它被合并并溢出到磁盘。 如果指定了合并器，它将在合并期间运行，以减少写入磁盘的数据量 。“

Hadoop组合器排序阶段

问题描述

4 个解决方案

解决方案1
14 已采纳 2011-10-19 18:35:32

解决方案2
3 2014-02-23 22:57:25

解决方案3
2 2011-10-19 18:36:49

解决方案4
0 2012-12-20 05:46:49

Hadoop组合器排序阶段

问题描述

4 个解决方案

解决方案1 14 已采纳 2011-10-19 18:35:32

解决方案2 3 2014-02-23 22:57:25

解决方案3 2 2011-10-19 18:36:49

解决方案4 0 2012-12-20 05:46:49

解决方案1
14 已采纳 2011-10-19 18:35:32

解决方案2
3 2014-02-23 22:57:25

解决方案3
2 2011-10-19 18:36:49

解决方案4
0 2012-12-20 05:46:49