简体繁体 English

HADOOP组合器操作功能

[英]HADOOP combiner operation functioning

原文 2013-10-24 12:54:16 7 1 hadoop/ mapreduce/ combiners

I have a doubt about combiner functioning in Hadoop Map/Reduce Framework. 我对Hadoop Map / Reduce Framework中的组合器功能有疑问。 The combiner operation is applied only on key-value pairs output by a map task or on all map tasks occurring on a given node. 组合器操作仅应用于映射任务输出的键-值对或出现在给定节点上的所有映射任务。 In fact, i have done some tests and it seems to be the first one. 实际上，我已经做了一些测试，这似乎是第一个。 If I'm right, according to you, why this behavior has been chosen knowing that combining all map tasks outputs can be very beneficial to decrease bandwidth use. 根据您的说法，如果我是对的，那么为什么选择此行为是因为知道组合所有映射任务输出对于减少带宽使用非常有益。

thanks in advance 提前致谢

1 个解决方案

How does it know when all the map tasks will be complete? 如何知道何时完成所有地图任务？ The TaskTracker doesn't know how the JobTracker will assign map tasks. TaskTracker不知道JobTracker将如何分配地图任务。 You would probably have to wait for all the map tasks to be complete before running the combiners. 在运行组合器之前，您可能必须等待所有映射任务完成。
You still want to keep the data flow between mappers and reducers moving. 您仍然希望保持映射器和化简器之间的数据流移动。 As combiners run and output is created, that data starts getting shuffled to the reducers right away (barring slowstart configuration set to something high). 在运行组合器并创建输出时，该数据立即开始被混编到减速器中（除非将slowstart配置设置为较高）。 This is good because it spreads out the network utilization over time. 这很好，因为它会随着时间的推移扩展网络利用率。