简体   繁体   中英

Iterating many times through Text input value in the Reducer in a MapReduce job

I have two very large datasets (tables) on HDFS. I want to join them on some column(s), then group them on some column(s), and then perform some group functions on some columns.

My steps are:

1- Create two jobs.

2- In the first job, in mapper read the rows of each dataset as map input value and emit join columns' values as map output key and remaining columns' values as map output value.

After mapping, the MapReduce framework performs shuffling and groups all the map output values according to map output keys.

Then, in reducer it reads each map output key and its values which man include many rows from both datasets.

What I want is to iterate through reduce input value many times so that I can perform cartesian product.

To illustrate:

Let's say for a join key x, I have 100 matches from one dataset and 200 matches from the other. It means joining them on join key x produces 100*200 = 20000 combination. I want to emit NullWritable as reduce output key and each cartesian product as reduce output value.

An example output might be:

for join key x:

From (nullWritable),(first(1),second(1))

Over (nullWritable),(first(1),second(200))

To (nullWritable),(first(100),second(200))

How can I do that?

I can iterate only once. And I could not cash the values because they dont fit into memory.

3- If I do that, I will start the second job, which takes the first job's result file as input file. In mapper, I emit group columns' values as map output key, and the remaining columns' values as map output value. Then in reducer by iterating through each key's value, I perform some functions on some columns like sum, avg, max, min.

Thanks a lot in advance.

Since your first MR job use join key as map output key, your 1st reducer will get (K join_key, List< V> values) for each reduce invocation. What you can do is just separate values apart into two separate list, each for a datasource, and use nested for loop to do cartesian product.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM