简体   繁体   中英

Hadoop Streaming Python Multiple Input Files Single Mapper

I have a single mapper.

for line in sys.stdin:
    #if line is from file1
    #process it based on some_arbitrary_logic
    #emit k,v

    #if line is from file2
    #process it based on another_arbitrary_logic
    #emit k, v

And I need to call this mapper through a hadoop streaming API with -input file1 and another -input file2 .

How do I achieve this? How do I know which line belongs to which file in the STDIN that hadoop streaming gives me?

UPDATE

File1

Fruit, Vendor, Cost

Oranges, FreshOrangesCompany, 50
Apples, FreshAppleCompany, 100

File2

Vendor, Location, NumberOfOffices

FreshAppleCompany, NewZealand, 45
FreshOrangeCompany, FijiIslands, 100

What I need to do is print out in how many offices do they sell oranges.

Oranges 100 .

So both files need to be INPUT to the mapper.

映射器python代码内的os.environ["map.input.file"]应提供映射器正在处理的块的文件名。

The question is slightly ambiguous, in the sense that not much details was provided as far as the purpose of the files are concerned. So I'm making some assumptions

  • If the file1 and file2 are just two sets of data file and have same kind of data and all you need to ensure is that the files are processed... then just copy the files to a HDFS folder and ensure that the folder is identified as input folder and you are good. Data from both files will be used to call mappers...

  • If it is that file1 and file2 have different purpose. For example, file1 is input file for mapper but file2 is something that you need to refer to for some joins or etc...then distributed cache. Check this Hadoop Streaming with multiple input

  • If file1 and file2 are both input files and are related and you need to do a join. If file1 or file2 is small, then you might be able to use it as distributed cache using regular file or archive file. However, if both the files are big, then it is slightly complicated as you may have to do multiple MR or do an transformation of the file to a format that can be used by Hive and use a hive join and then use the join result as input to your Python Mapper job.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM