简体   繁体   中英

Join of two datasets in Mapreduce/Hadoop

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?

More specifically, here's what I exactly need to do:

I am having two sets of data:

  1. point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info

  2. Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info

As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number


In order to clarify, here's an example:

For point pairs:

(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)

for line pairs:

(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)

what I want is as following:

for tile 0:

 (tile0, point0:line0)
 (tile0, point0:line1)
 (tile0, point1:line0)
 (tile0, point1:line1)

for tile 1:

 (tile1, point1:line2)
 (tile1, point1:line3)
 (tile1, point2:line2)
 (tile1, point2:line3)

Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

So the map output will be something like:

 tile0, _point0
 tile1, _point0
 tile2, _point1 
 ...
 tileX, *lineL
 tileY, *lineK
 ...

Then, at the reducer, your input will have this structure:

 tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]

and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:

tileX (lineK, pointP)
tileX (lineK, pointR)
...

If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:

 List<String> points;
 List<String> lines;

Then do the cross-product using 2 nested for loops. Then iterate through the resulting list and for each element output:

tile(current key), element_of_the_resulting_cross_product_list

So basically you have two options here.Reduce side join or Map Side Join .

Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient .

If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.

I found this one as helpful

Joins with plain Map Reduce or MultipleInputs

http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM