简体   繁体   中英

Query related to Hadoop's map-reduce

Scenario:

I have one subset of database and one dataware house. I have bring this both things on HDFS. I want to analyse the result based on subset and datawarehouse. (In short, for one record in subset I have to scan each and every record in dataware house)

Question:

I want to do this task using Map-Reduce algo. I am not getting that how to take both files as a input in mapper and also how to handle both files in map phase of map-reduce.

Pls suggest me some idea so that I can able to perform it?

Some time ago I wrote a Hadoop map reduce for one of my classes. I was scanning several IMD databases and producing a merged information about actors (basically the name, biography and films he acted in was in different databases). I think you can use the same approach I used for my homework: I wrote a separate map reduce turning every database file in the same format, just placing a two-letter prefix infront of every row the map-reduce produced to be able to tell 'BI' (biography), 'MV' (movies) and so on. Then I used all these produced files as input for my last map reduced that processed them grouping them in the desired way.

I am not even sure that you need so much work if you are really going to scan every line of the datawarehouse. Maybe in this case you can just do this scan either in the map or the reduce phase (based on what additional processing you want to do), but my suggestion assumes that you actually need to filter the datawarehouse based on the subsets. If the latter my suggestion might work for you.

Check the Section 3.5 (Relations Joins) in Data-Intensive Text Processing with MapReduce for Map-Side Joins, Reduce-Side Joins and Memory-Backed Joins. In any case MultipleInput class is used to have multiple mappers process different files in a single job.

FYI, you could use Apache Sqoop to import DB into HDFS.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM