简体   繁体   English

Hadoop Streaming Python多个输入文件单个映射器

[英]Hadoop Streaming Python Multiple Input Files Single Mapper

I have a single mapper. 我只有一个映射器。

for line in sys.stdin:
    #if line is from file1
    #process it based on some_arbitrary_logic
    #emit k,v

    #if line is from file2
    #process it based on another_arbitrary_logic
    #emit k, v

And I need to call this mapper through a hadoop streaming API with -input file1 and another -input file2 . 我需要通过具有-input file1和另一个-input file2的hadoop流API调用此映射器。

How do I achieve this? 我该如何实现? How do I know which line belongs to which file in the STDIN that hadoop streaming gives me? 我怎么知道hadoop流给我的STDIN中的哪一行属于哪个文件?

UPDATE UPDATE

File1

Fruit, Vendor, Cost

Oranges, FreshOrangesCompany, 50
Apples, FreshAppleCompany, 100

File2

Vendor, Location, NumberOfOffices

FreshAppleCompany, NewZealand, 45
FreshOrangeCompany, FijiIslands, 100

What I need to do is print out in how many offices do they sell oranges. 我需要做的是打印出他们出售橘子的办事处数量。

Oranges 100 . Oranges 100

So both files need to be INPUT to the mapper. 因此,两个文件都需要INPUT到映射器。

映射器python代码内的os.environ["map.input.file"]应提供映射器正在处理的块的文件名。

The question is slightly ambiguous, in the sense that not much details was provided as far as the purpose of the files are concerned. 这个问题有点模棱两可,因为就文件的目的而言,没有提供太多细节。 So I'm making some assumptions 所以我做一些假设

  • If the file1 and file2 are just two sets of data file and have same kind of data and all you need to ensure is that the files are processed... then just copy the files to a HDFS folder and ensure that the folder is identified as input folder and you are good. 如果file1和file2只是两组数据文件且具有相同类型的数据,并且您需要确保对文件进行了处理...则只需将文件复制到HDFS文件夹中,并确保将该文件夹标识为输入文件夹,你就很好。 Data from both files will be used to call mappers... 来自这两个文件的数据将用于调用映射器...

  • If it is that file1 and file2 have different purpose. 如果是,file1和file2具有不同的用途。 For example, file1 is input file for mapper but file2 is something that you need to refer to for some joins or etc...then distributed cache. 例如,file1是映射器的输入文件,而file2是某些联接或其他内容需要引用的文件,然后是分布式缓存。 Check this Hadoop Streaming with multiple input 使用多个输入检查此Hadoop流

  • If file1 and file2 are both input files and are related and you need to do a join. 如果file1和file2都是输入文件并且是相关的,则需要进行联接。 If file1 or file2 is small, then you might be able to use it as distributed cache using regular file or archive file. 如果file1或file2很小,则可以使用常规文件或归档文件将其用作分布式缓存。 However, if both the files are big, then it is slightly complicated as you may have to do multiple MR or do an transformation of the file to a format that can be used by Hive and use a hive join and then use the join result as input to your Python Mapper job. 但是,如果两个文件都很大,则可能会有些复杂,因为您可能必须执行多个MR或将文件转换为Hive可以使用的格式,并使用hive联接,然后将联接结果用作输入到您的Python Mapper作业。

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM