简体   繁体   中英

Data set join using EMR

I have 2 tab-delimited datasets stored in AWS S3. I am trying to write an EMR job that will join these 2 datasets based on a common key (a set of field values). My current version populates 2 lists and compares them line by line; outputting the rows that have a common key. I have been writing in python but cannot seem to figure out the logic behind bringing 2 files through stdin and comparing each row with one another in order to join the two datasets. Most of the documentation I find is in Java. I am using Amazon's EMR to run all my jobs. Any help is greatly appreciated.

thank you

As you are using EMR already, have you looked at Hive?

http://aws.amazon.com/articles/Elastic-MapReduce/3681655242374956

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM