简体繁体 English

所有reducer运行后运行代码

[英]Running Code after all reducers run

原文 2016-07-03 12:12:09 5 2 java/ hadoop/ mapreduce

I am trying to write a mapreduce program that computes an average of some statistics. 我正在尝试编写一个mapreduce程序，该程序计算一些统计数据的平均值。

The mappers read the data in its respective segment and perform some filters. 映射器读取其各自段中的数据并执行一些过滤器。

I am using multiple Reducers. 我正在使用多个Reducer。

Therefore the reducers will be capable of calculating only the local average in that partition. 因此，归约器将仅能够计算该分区中的局部平均值。 I however need the average of all the data coming to all the reducers. 但是，我需要所有减速器的所有数据的平均值。 How do i pull this off? 我如何做到这一点？

One idea is to use global counters to hold the sum and count. 一种想法是使用全局计数器来保存总和和计数。 But i need a segment of code that runs after all reducers have run(so that i can operate on the final sum and count) and output the average to a file. 但是我需要一段代码在所有reducer运行之后运行（以便我可以对最终的总和和计数进行运算），并将平均值输出到文件中。 Is this a viable approach and how can i do this? 这是可行的方法，我该怎么做？

Also note that i have to use multiple reducers. 另请注意，我必须使用多个减速器。 So the option of having just one reducer and doing the average computation in the cleanup method is out of the window. 因此，仅使用一个减速器并在清除方法中进行平均计算的选择不在窗口之内。

2 个解决方案

Option 1 .- Implement a combiner and use only one reducer. 选项1-实现组合器，并且仅使用一个减速器。 The combiner will reduce the amount of data to be transfer to the reducer(s). 组合器将减少要传输到减速器的数据量。 If the reason of use more than one reducer is the amount of data that you are processing, this could be an option. 如果使用多个reducer的原因是要处理的数据量，则可以选择。

Option 2 .- Inside each Mapper compute the partial sum/count in memory and just write to the output the aggregated values in the cleanup method. 选项2 .-在每个Mapper内部，计算内存中的部分总和/计数，然后以清除方法将汇总值写入输出。 Allowing you to use only one reducer to compute the final average. 仅允许您使用一个减速器来计算最终平均值。

Option 3 .- Implement your process using two map-reduce jobs. 选项3-使用两个map-reduce作业实现您的流程。 One to calculate a partial sum/count in each reducer and then other map-reduce with identity maps and with only one reducer to compute the average. 一个用于在每个化简器中计算部分总和/计数，然后计算具有标识映射的其他map-reduce，并且仅使用一个化简器来计算平均值。

Option 4 .- Use counters and, as @Thomas suggest, implement the logic after the waitForCompletion. 选项4 .-使用计数器，并按照@Thomas的建议，在waitForCompletion之后实现逻辑。

Option 5 .- Use the output of the reducers to compute the average reading the HDFS files (use counters could be more simple). 选项5 .-使用reducers的输出来计算读取HDFS文件的平均值（使用计数器可能更简单）。

In my opinion the Option 2 is the most simple and clean to implement. 我认为方案2是最简单，最干净的实施方案。 And the Option 1 the most generic option, useful if you require to calculate more than one average at the same time and compute the sum/counts in memory is not possible (counters are more restrictive, just some thousands). 选项1是最通用的选项，如果您需要同时计算多个平均值并且无法计算内存中的总和/计数，则很有用（计数器的限制更为严格，只有几千个）。

If you insist on using multiple reducers for this job, then I guess you should be doing multiple (in your case 2) job chain. 如果您坚持要为此工作使用多个reducers，那么我想您应该做多个（在您的情况下为2）工作链。 The first job will do whatever you have right now. 头一份工作将做您现在拥有的一切。 The second job will be setup for calculating the overall average. 将设置第二项作业以计算总体平均值。 So the first job's output goes as input for the second job. 因此，第一个作业的输出将作为第二个作业的输入。

You can see my answer here , to see how to set up a chain of jobs in a single driver class. 您可以在这里看到我的答案，以了解如何在单个驱动程序类中设置一系列作业。