简体   繁体   中英

Hadoop WordCount Combiner

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code

In the word count example reduce function is used for both as combiner and reducer.

   public static class IntSumReducer extends Reducer<Text, IntWritable, Text,IntWritable> {

      public void reduce(Text key, Iterable<IntWritable> values, Context context) 
    throws IOException, InterruptedException {
       int sum = 0;
       for (IntWritable val : values) {
           sum += val.get();
       }
       context.write(key, new IntWritable(sum));
   }
  }

I understood the way reducer works, but in the case of combiner, suppose my input is

  <Java,1> <Virtual,1> <Machine,1> <Java,1>

It consider the first kv-pair and give the same output...!!?? since I've only one value. How come it considers both keys and make

  <Java,1,1>  

since we are considering one kv pair at a time? I know this a false assumption; someone please correct me on this please

The IntSumReducer class inherits the Reducer class and the Reducer class doing the magic here, if we look in to the documentation

"Reduces a set of intermediate values which share a key to a smaller set of values. Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.

Reducer has 3 primary phases:

Shuffle:The Reducer copies the sorted output from each Mapper using HTTP across the network.

Sort:The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously ie while outputs are being fetched they are merged."

The program calling same class for combine and reduce operations;

job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

so what I figured out is if we are using only one data node we don't necessarily to call the combiner class for this wordcount program since the reducer class itself take care of the combiner job.

job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);

The above method also have same effect on wordcount program if you using only one data node.

Combiner combines the mapper result first before sending to the reducer.

A mapper on a host may output many same key of kv pairs. And combiner will

merge the map output first before sending to reducer, therefore reducing

the shuffle cost between mapper and reducer.

So if a mapper with output (key, 1) (key, 1), combiner will combine the result to (key ,[1,1])

Combiner runs on Map Output. In your case Map Output is like,

<Java,1> <Virtual,1> <Machine,1> <Java,1>,

So it will run for each key, so in your case Java is present two times, hence it is generating result as (Key, [Comma separated Values]).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM