Why this Hadoop example that use Combiner class can't work properly? (don't perform the "local reduction" provided by the Combiner)

Question

I am absolutely new in Hadoop and I am doing some experiment trying to use the Combiner class to perform the reduce operation locally on the same node of the mapper. I am using Hadoop 1.2.1.

So I have these 3 classes:

WordCountWithCombiner.java :

// Learning MapReduce by Nitesh Jain
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;

/* 
 * Extend Configured class: g
 * Implement Tool interface:
 * 
 */
public class WordCountWithCombiner extends Configured implements Tool{

  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = getConf(); 
    
    Job job = new Job(conf, "MyJob");   // Job is a "dashboard" with levers to control the execution of the job
    
    job.setJarByClass(WordCountWithCombiner.class);             // Name of the driver class into the jar
    job.setJobName("Word Count With Combiners");    // Set the name of the job

    FileInputFormat.addInputPath(job, new Path(args[0]));           // The input file is the first paramether of the main() method
    FileOutputFormat.setOutputPath(job, new Path(args[1]));         // The output file is the second paramether of the main() method
    
    job.setMapperClass(WordCountMapper.class);          // Set the mapper class
    
    /* Set the combiner: the combiner is a reducer performed locally on the same mapper node (we are resusing the previous WordCountReduces
     * class because it perform the same task, but locally to the mapper):
     */
    job.setCombinerClass(WordCountReducer.class);
    job.setReducerClass(WordCountReducer.class);        // Set the reducer class

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    
    return job.waitForCompletion(true) ? 0 : 1;
   
   }
  
  public static void main(String[] args) throws Exception {
    /* The ToolRunner object is used to trigger the run() function which contains all the batch execution logic. 
     * What it does is gie the ability to set properties at the own time so we need not to write a single line of code to handle it
     */
    int exitCode = ToolRunner.run(new Configuration(), new WordCountWithCombiner(), args);
    System.exit(exitCode);
}

}

WordCountMapper.java :

// Learning MapReduce by Nitesh J.
// Word Count Mapper. 
import java.io.IOException;
import java.util.StringTokenizer;

// Import KEY AND VALUES DATATYPE:
import org.apache.hadoop.io.IntWritable;    // Similiar to Int
import org.apache.hadoop.io.LongWritable;   // Similar to Long
import org.apache.hadoop.io.Text;           // Similar to String

import org.apache.hadoop.mapreduce.Mapper;

/* Every mapper class extend the Hadoop Mapper class.
 * @param input key (the progressive number)
 * @param input type (it is a word so something like a String)
 * @param output key
 * @param output value
 * 
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  /* Override the map() function defined by the Mapper extended class:
   * The input parameter have to match with these defined into the extended Mapper class
   * @param context: is used to cast the output key and value paired.
   * 
   * Tokenize the string into words and write these words into the context with words as key, and one (1) as value for each word
   */
  @Override
  public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
      
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
    
      while (itr.hasMoreTokens()) {
          //just added the below line to convert everything to lower case 
          word.set(itr.nextToken().toLowerCase());
          // the following check is that the word starts with an alphabet. 
          if(Character.isAlphabetic((word.toString().charAt(0)))){
              context.write(word, one);
          }
    }
  }

}

WordCountReducer.java :

// Learning MapReduce by Nitesh Jain
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/* Every reduceer calss have to extender the Hadoop Reducer class
 * @param the mapper output key  (text, the word)
 * @param the mapper output value (the number of occurrence of the related word: 1)
 * @param the redurcer output key (the word)
 * @param the reducer output value (the number of occurrence of the related word)
 * Have to map the Mapper() param
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    /*
     * I have to override the reduce() function defined by the extended Reducer class
     * @param key: the current word
     * @param Iterable<IntWritable> values: because the input of the recudce() function is a key and a list of values associated to this key
     * @param context: collects the output <key, values> pairs
     */
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
        
        int sum = 0;
        for (IntWritable value : values) {
          sum += value.get();
        }
        context.write(key, new IntWritable(sum));
      }

}

So as you can see into the WordCountWithCombiner driver class I have set the WordCountReducer class as combiner to perform a reduction directly on the mapper node, by this line:

job.setCombinerClass(WordCountReducer.class);

Then I have this input file on the Hadoop File System:

andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop fs -cat  in
to be or not to be

And I want to operate on it.

If I perform the previous batch in the classical way passing through the 2 phased of map and reduce it works fine, in fact performing this statement into the Linux shell:

andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop jar WordCount.jar WordCountWithCombiner in out6

Hadoop do it works and then I obtain the expected result:

andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop fs -cat  out6/p*
be  2
not 1
or  1
to  2
andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$

Ok, it works fine.

The problem is that now I want don't perform the reduce phase and I expect the same result because I have setted the combiner that do the same thing on the same node of the reducer.

So, into the Linux shell I perform this statement that exclude the reducer phase:

hadoop jar WordCountWithCombiner.jar WordCountWithCombiner -D mapred.reduce.tasks=0 in out7

But it don't works fine because this is what I obtain (I post the entire output to add more information about what is happening):

andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop jar WordCountWithCombiner.jar WordCountWithCombiner -D mapred.reduce.tasks=0 in out7
16/02/13 19:43:44 INFO input.FileInputFormat: Total input paths to process : 1
16/02/13 19:43:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/02/13 19:43:44 WARN snappy.LoadSnappy: Snappy native library not loaded
16/02/13 19:43:45 INFO mapred.JobClient: Running job: job_201601242121_0008
16/02/13 19:43:46 INFO mapred.JobClient:  map 0% reduce 0%
16/02/13 19:44:00 INFO mapred.JobClient:  map 100% reduce 0%
16/02/13 19:44:05 INFO mapred.JobClient: Job complete: job_201601242121_0008
16/02/13 19:44:05 INFO mapred.JobClient: Counters: 19
16/02/13 19:44:05 INFO mapred.JobClient:   Job Counters 
16/02/13 19:44:05 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=18645
16/02/13 19:44:05 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/02/13 19:44:05 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/02/13 19:44:05 INFO mapred.JobClient:     Launched map tasks=1
16/02/13 19:44:05 INFO mapred.JobClient:     Data-local map tasks=1
16/02/13 19:44:05 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
16/02/13 19:44:05 INFO mapred.JobClient:   File Output Format Counters 
16/02/13 19:44:05 INFO mapred.JobClient:     Bytes Written=31
16/02/13 19:44:05 INFO mapred.JobClient:   FileSystemCounters
16/02/13 19:44:05 INFO mapred.JobClient:     HDFS_BYTES_READ=120
16/02/13 19:44:05 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=55503
16/02/13 19:44:05 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=31
16/02/13 19:44:05 INFO mapred.JobClient:   File Input Format Counters 
16/02/13 19:44:05 INFO mapred.JobClient:     Bytes Read=19
16/02/13 19:44:05 INFO mapred.JobClient:   Map-Reduce Framework
16/02/13 19:44:05 INFO mapred.JobClient:     Map input records=1
16/02/13 19:44:05 INFO mapred.JobClient:     Physical memory (bytes) snapshot=93282304
16/02/13 19:44:05 INFO mapred.JobClient:     Spilled Records=0
16/02/13 19:44:05 INFO mapred.JobClient:     CPU time spent (ms)=2870
16/02/13 19:44:05 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
16/02/13 19:44:05 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=682741760
16/02/13 19:44:05 INFO mapred.JobClient:     Map output records=6
16/02/13 19:44:05 INFO mapred.JobClient:     SPLIT_RAW_BYTES=101
andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop fs -cat  out7/p*to   1
be  1
or  1
not 1
to  1
be  1

So as you can see it seems that the local reduction provided by the Combiner doesn't work.

Why? What am I missing? How can I try to solve this issue?

Answer 1

Do not assume that the combiner will run. Treat the combiner only as an optimization. The Combiner is not guaranteed to run over all of your data. In some cases when the data doesn't need to be spilled to disk, MapReduce will skip using the Combiner entirely. Note also that the Combiner may be ran multiple times over subsets of the data! It'll run once per spill.

Hence when the number of reducers is set to 0, it doesn't actually mean it should give the correct result as all the mappers data is not covered by the Combiners.

Why this Hadoop example that use Combiner class can't work properly? (don't perform the "local reduction" provided by the Combiner)

Question

1 answers

solution1
1 2016-02-15 04:56:04

Why this Hadoop example that use Combiner class can't work properly? (don't perform the "local reduction" provided by the Combiner)

Question

1 answers

solution1 1 2016-02-15 04:56:04

solution1
1 2016-02-15 04:56:04