简体繁体 English

在Reducer上执行Hadoop组合器

[英]Hadoop combiner execution on reducers

原文 2015-05-05 14:58:04 8 1 hadoop/ mapreduce/ aggregation/ reducers/ combiners

I have a long running MapReduce job with some mappers taking considerably more time than others. 我的MapReduce工作时间很长，有些映射器比其他映射器花费更多的时间。

Checking the stats on the web interface, I saw that my combiner also kicked in on the reducers (which where mostly idle as just 2 mappers were still running). 查看Web界面上的统计信息，我发现我的组合器也加入了reducers（在缩减器中，由于只有2个映射器仍在运行，它们在很大程度上处于空闲状态）。

Although it seems reasonable to not waste time and do some pre-aggregation until all mappers have finished, I cannot find any documentation for this behaviour. 尽管在所有映射器完成之前不浪费时间并进行一些预聚合似乎是合理的，但我找不到有关此行为的任何文档。 Can anyone confirm that this is indeed a feature of Hadoop or just displayed wrong on the web interface? 任何人都可以确认这确实是Hadoop的功能或只是在Web界面上显示错误吗？

1 个解决方案

The combiner starts when a reasonable amount of data has been emitted by the mapper. 当映射器发出了合理数量的数据时，组合器将启动。 Note that a combiner runs as an aggregation (typically) of a mapper's output (and not on the reduce side). 请注意，组合器作为映射器输出的聚合（通常）运行（而不是在简化方面）。 More details can be found here . 可以在此处找到更多详细信息。

Also, the reducers can start gathering (only) the data that are emitted by the mappers, before all the mappers have finished. 同样，在所有映射器完成之前，reducer可以开始（仅）收集映射器发出的数据。 That is called the shuffling phase of the reducer. 这称为减速器的改组阶段。 You can change the time when the reducers will start gathering data, by changing the mapred.reduce.slowstart.completed.maps property (or mapreduce.job.reduce.slowstart.completedmaps in newer versions). 您可以通过更改mapred.reduce.slowstart.completed.maps属性（或在新版本中的mapreduce.job.reduce.slowstart.completedmaps ）来更改化mapred.reduce.slowstart.completed.maps开始收集数据的时间。 More details on this SO post . 有关此SO帖子的更多详细信息。