简体   繁体   中英

How can I join three tables?

I have a huge set of data with three tables, assuming that three tables have data similar to :

Table A:

Id name place

1 aaa place1
2 bbb place2

Table B:

Id cId name

1 11 aaa
2 22 bbb

Table C:

cId cname

11 p1
22 p2

When I join Table A and B using hadoop mapreduce I get the output

kv

1 aaa place1 11
2 bbb place2 22

Now I want to join Table C with the above output where I can replace 11 --> p1.How can I solve this problem ?

Probably the most easiest solution is to use Pig as @David mentioned. For a quick test you come up with something like this:

TABLE_A = LOAD 'hdfs://my_path/input/table_a.txt' using PigStorage(' ') AS (
            id:chararray, 
            name:chararray, 
            place:chararray
          );

TABLE_B = LOAD 'hdfs://my_path/input/table_b.txt' using PigStorage(' ') AS (
            id:chararray, 
            cid:chararray, 
            name:chararray
          );

TABLE_C = LOAD 'hdfs://my_path/input/table_c.txt' using PigStorage(' ') AS (
            cid:chararray, 
            cname:chararray
          );

TMP = FOREACH (join TABLE_A by id, TABLE_B by id) GENERATE 
        TABLE_A::id as id, 
        TABLE_A::name as name, 
        TABLE_A::place as place, 
        TABLE_B::cid as cid;


JOIN_ABC = FOREACH (join TMP by cid, TABLE_C by cid) GENERATE 
             TMP::id, 
             TMP::name, 
             TMP::place, 
             TABLE_C::cname;

store JOIN_ABC into 'hdfs://my_path/output' using PigStorage(' ');

The common algorithm if you want to join two datasets on map reduce is:

  • to map each dataset to rearrange fields and turn field you want join on to the key of dataset, also its useful to mark each record in order to distinguish (later during reduce stage) from which dataset this record from
  • to concat those datasets into one
  • to reduce the dataset, since the key is the field you want join on - all you need is perform join on grouped data

So if you understand how to join two dataset, you can repeat this operation to join with third.

Disadvantage of such approach is if one of your dataset is dictionary of small size the number of reducers on reduce stage will be limited to the size of that dictionary (actually they are limited by size of different keys space which doesn't exceed the size of the dictionary)

I do not think that you can in one MR step to join 3 tables. So I think you need simply another MR job which will take results of joined A,B and join them with C.
And a bit off - I would suggest using Hive or Pig for it before coding MR in Java.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM