MapReduce Job distribution among reducers

Question

I developed a small mapreduce program. When i opened the process log, i saw that one map and two reducers were created by the framework. I had only one file for input and got two output files. Now please tell me

1) Number of mapper and reducer are created by framework or it can be changed?
2) Number of output files always equal to number of reducers? i.e. each reducer
   creates its   own output file?
3) How one input file is distributed among mappers? And output of one mapper is 
   distributed among multiple reducers (this is done by framework or you can change)?
4) How to manage when multiple input files are there i.e. A directory ,
   containing input files?

Please answer these questions. I am beginner to MapReduce.

Answer 1

Let me attempt to answer your questions. Please tell me wherever you think is incorrect -

1) Number of mapper and reducer are created by framework or it can be changed?

Total number of map tasks created depends on the total number of logical splits being made out of the HDFS blocks. So, fixing the number of map tasks may not always be possible because different files can have different sizes and with that different number of total blocks. So, if you are using TextInputFormat, roughly each logical split equals to a block and fixing number of total map task would not be possible since, for each file there can be different number of blocks created.

Unlike number of mappers, reducers can be fixed.

2) Number of output files always equal to number of reducers? ie each reducer creates its own output file?

To certain degree yes but there are ways with which it's possible to create more than one output file from a reducer. For eg: MultipleOutputs

3) How one input file is distributed among mappers? And output of one mapper is distributed among multiple reducers (this is done by framework or you can change)?

Each file in HDFS is composed of blocks. Those blocks are replicated and can remain in multiple nodes (machines). Map tasks are then scheduled to runs upon these blocks. The level of concurrency with which map task can run, depends upon the number of processors each machine have. Eg for a file if 10,000 map tasks are scheduled, depending upon total number of processors throughout the cluster, only a 100 can run concurrently at a time.

By default Hadoop uses HashPartitioner, which calculates the hashcode of the keys being sent from the Mapper to the framework and converts them to a partition.

Eg:

  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

As you can see above, a partition is selected out of the total number of reducers that's fixed based upon the hash code. So, if your numReduceTask = 4, the value returned would be between 0 to 3.

4) How to manage when multiple input files are there ie A directory , containing input files?

Hadoop supports a directory consisting of multiple files as a input to a job.

Answer 2

As explained by 'SSaikia_JtheRocker' mapper tasks are created according to the total number of logical splits on HDFS blocks. I would like to add something to the question #3 "How one input file is distributed among mappers? And output of one mapper is distributed among multiple reducers (this is done by framework or you can change)?" For example consider my word count program which counts the number of words in a file is shown below:

#

public class WCMapper extends Mapper {

 @Override public void map(LongWritable key, Text value, Context context) // Context context is output throws IOException, InterruptedException { // value = "How Are You" String line = value.toString(); // This is converting the Hadoop's "How Are you" to Java compatible "How Are You" StringTokenizer tokenizer = new StringTokenizer (line); // StringTokenizer returns an array tokenizer = {"How", "Are", "You"} while (tokenizer.hasMoreTokens()) // hasMoreTokens is a method in Java which returns boolean values 'True' or 'false' { value.set(tokenizer.nextToken()); // value's values are overwritten with "How" context.write(value, new IntWritable(1)); // writing the current context to local disk // How, 1 // Are, 1 // You, 1 // Mapper will run as many times as the number of lines } }

}

#

So in the above program, for the line "How are you" is split into 3 words by StringTokenizer and when used this in the while loop, the mapper is called as many times as the number of words, so here 3 mappers are called.

And reducer, we can specify like how many reducers we want our output to be generated in using 'job.setNumReduceTasks(5);' statement. Below code snippet will give you an idea.

#

public class BooksMain {

 public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); // Use programArgs array to retrieve program arguments. String[] programArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); Job job = new Job(conf); job.setJarByClass(BooksMain.class); job.setMapperClass(BookMapper.class); job.setReducerClass(BookReducer.class); job.setNumReduceTasks(5);

// job.setCombinerClass(BookReducer.class);

  job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // TODO: Update the input path for the location of the inputs of the map-reduce job. FileInputFormat.addInputPath(job, new Path(programArgs[0])); // TODO: Update the output path for the output directory of the map-reduce job. FileOutputFormat.setOutputPath(job, new Path(programArgs[1])); // Submit the job and wait for it to finish. job.waitForCompletion(true); // Submit and return immediately: // job.submit(); }

}

#

MapReduce Job distribution among reducers

Question

2 answers

solution1
4 2013-08-29 11:06:57

solution2
0 2014-04-02 16:54:01

MapReduce Job distribution among reducers

Question

2 answers

solution1 4 2013-08-29 11:06:57

solution2 0 2014-04-02 16:54:01

solution1
4 2013-08-29 11:06:57

solution2
0 2014-04-02 16:54:01