What is the control flow of Hadoop mapper with MultipleInputs?

Question

Objective: To implement Reduce Side Join

I have Job Chaining (two jobs) at present in my code. now I want to implement joining at the reduce-side with another job. And I have to take multiple inputs:

Input #1: Output from the previous reducer.
Input #2: New file from HDFS to implement join.

Saw some articles on how to use MultipleInputs.addInputhPath(job, Path, InputFormat.class, Mapper.class);

So I understand that I have to use it twice, once for Input #1 and once for Input #2 .

Question 1: Then if I use two separate mappers and a single reducer, which mapper will be executed first (or will they be executed in parallel)? How to check on the reducer side which mapper has emitted the <key, value> pair?

Question 2: If I use a single mapper and a single reducer, what will be the control flow?

Question 3: More of a Hack, ie not using MultipleInputs
Is it OK (performance wise) to use DistributedCache to load the Input #2 in the reducer's setup() method? And taking the output from the previous reducer as the only Input for the job.

Note: The Input #2 file is pretty small in size.

Answer 1

Answer 1:
Map tasks for both the Mappers should run parallel provided slots are available. Presence of a single slot may cause them to run in sequence(with possible interleaving), but it's not a normal scenario. If there is some configuration for Mapper sequence I'm not aware of this.

Again I doubt that any api is available to identify which Mapper emitted <key, value> . To be precise, just the identification of value is required since same key may be emitted by different Maps. This is usually achieved by adding a prefix tag to the output value and resolving those tags in the reducer. eg :

if(value.toString.startsWith("Input#1")){  //processing code }

Have a look at this blogpost , it has all the required tips and tricks. Please note that all those examples are using old mapred api. But the logic will be the same in any case.

Answer 2:
Without MultipleInputs , in the Map you have to identify the filename of an incoming pair using the available Context object. eg :

public void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {
        String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();   
        //............
}

Then just prepend a suitable tag to output value and rest is same as answer1.

Answer 3:
This is tricky. Use of DistributedCache for joining boosts performance when the file that's going to be added to the cache is small. This is probably due to the fact that the job now runs with a lesser number of Map tasks. But it affects adversely with the large files. The dilemma is to know how many bytes are considered small for DistributedCache .

Since you mentioned that Input #2 file is pretty small, this should be your most suited solution.

NB : Many comments of this post are (a little) opinion based. Expecting inputs from experts.

What is the control flow of Hadoop mapper with MultipleInputs?

Question

1 answers

solution1
1 2014-10-24 10:04:41

What is the control flow of Hadoop mapper with MultipleInputs?

Question

1 answers

solution1 1 2014-10-24 10:04:41

solution1
1 2014-10-24 10:04:41