How do I run two different mappers on the same input and have their output sent to a single reducer?

Question

I have some flight data (each line containing origin, destination, flight number, etc) and I need to process it to output flight details between all origins and destinations with one stopover , my idea is to have two mappers (one outputs destination as key and the other outputs origin as key, therefore the reducer gets the stopover location as key and all origin and destination as an array of values). Then I can output flight details with one stopover for all locations in the reducer.

So my question is how do I run two different mappers on the same input file and have their output sent to one reducer.

I read about MultipleInputs.addInputPath , but I guess it needs input to be different (or atleast two copies of the same input).

I am thinking of running the two mapper jobs independently using a workflow and then a third Identity mapper and reducer where I will do the flight calculation.

Is there a better solution that this? (Please do not ask me to use Hive, am not comfortable with it yet) Any guidance on implementing using mapreduce would really help. Thanks.

Answer 1

Your question did not specify if you wish to mix/match (stopover/no stopovers) together.

So I will go ahead with the stated question: that is only consider one (not zero) stopovers.

In that case simply have two Map/Reduce stages. First stage Mapper outputs

(dest1, source1).

First stage reducer receives (dest1, Array(source1, source2, ...)

The first stage reducer then writes its tuples to hdfs output directory.

Now do the second stage: the mapper input uses the Stage1 reducer output as its source directory.

Second stage mapper reads:

(dest1, Array(source1, source2, ...))

Second stage mapper outputs:

 (dest2, (source1,dest1))

Then your final (stage2) reducer receives:

(dest2,  Array( (source11,dest11), (source12, dest12), (source13, dest13) ,...)

and it writes that data to the hdfs output. You can then use any external tools you like to read those results from hdfs.

Answer 2

I think you can do it with just one Mapper.

The Mapper emits each (src,dst,fno,...) input record twice, once as (src,(src,dst,fno,...)) and once as (dst,(src,dst,fno,...)) . In the Reducer you need to figure out for each record whether its key is a source or destination and do the stop-over join. Using a flag to indicate the role of the key and a secondary sort can make this a bit more efficient.

That way only a single MR job with one Mapper and one Reducer is necessary for the task.

How do I run two different mappers on the same input and have their output sent to a single reducer?

Question

2 answers

solution1
0 2015-03-29 18:53:49

solution2
0 ACCPTED 2015-03-30 20:45:58

How do I run two different mappers on the same input and have their output sent to a single reducer?

Question

2 answers

solution1 0 2015-03-29 18:53:49

solution2 0 ACCPTED 2015-03-30 20:45:58

solution1
0 2015-03-29 18:53:49

solution2
0 ACCPTED 2015-03-30 20:45:58