简体   繁体   English

Python MapReduce Hadoop Streaming Job需要多个输入文件?

[英]Python MapReduce Hadoop Streaming Job that requires multiple input files?

I have two files in my cluster File A and File B with the following data - 我的群集File AFile B有两个文件,包含以下数据 -

File A 档案A.

#Format: 
#Food Item | Is_A_Fruit (BOOL)

Orange | Yes
Pineapple | Yes
Cucumber | No
Carrot | No
Mango | Yes

File B 档案B.

#Format:
#Food Item | Vendor Name

Orange | Vendor A
Pineapple | Vendor B
Cucumber | Vendor B
Carrot | Vendor B
Mango | Vendor A

Basically I want to find out How many fruits are each vendor selling? 基本上我想知道每个供应商销售多少水果?

Expected output: 预期产量:

Vendor A | 2
Vendor B | 1

I need to do this using hadoop streaming python map reduce. 我需要使用hadoop流式传输python map reduce。

I have read how to do a basic word count, I read from sys.stdin and emit k,v pairs for the reducer to then reduce. 我已经阅读了如何进行基本的单词计数,我从sys.stdin读取并为reducer发出k,v对然后减少。

How do I approach this problem? 我该如何处理这个问题?

My main concern is how to read from multiple files and then compare them in Hadoop Streaming. 我主要关注的是如何读取多个文件,然后在Hadoop Streaming中进行比较。

I can do this in normal python (ie without MapReduce & Hadoop, it's straightforward.) but it's infeasible for the sheer size of data that I have with me. 我可以在普通的python中做到这一点(即没有MapReduce和Hadoop,它很简单。)但是对于我拥有的庞大数据大小它是不可行的。

Is File A really that large? 文件A真的那么大吗? I would put it in the DistributedCache and read it from there. 我会把它放在DistributedCache中并从那里读取它。 To put it in the distributed cache, use this option in the Hadoop streaming call: 要将其放入分布式缓存中,请在Hadoop流式调用中使用此选项:

-cacheFile 'hdfs://namenode:port/the/hdfs/path/to/FileA#FileA'

(I suppose the following should work too, but I have not tried it:) (我想以下应该也可以,但我还没试过:)

-cacheFile '/the/hdfs/path/to/FileA#FileA'

Note that the #fileA is the name you are using to make the file available to your mappers. 请注意, #fileA是用于使文件可供映射器使用的名称。

Then, in your mapper, you'll read FileB from sys.stdin (asuming you called Hadoop Streaming using -input '/user/foo/FileB' ) AND, to read FileA, you should do something like this: 然后,在你的映射器,你会读到FILEB从sys.stdin (asuming你叫使用Hadoop流-input '/user/foo/FileB' )AND,读FILEA,你应该做这样的事情:

f = open('FileA', 'r')
...
f.readline()

Now, I suppose you have already thought of this, but to me it would make sense to have a mapper like this: 现在,我想你已经想到了这一点,但对我来说,有一个像这样的映射器是有道理的:

  1. Open FileA 打开FileA
  2. Read FileA, line by line (in a loop) and load it into a map so that you can easily lookup a key and find its value (yes, no). 逐行读取FileA(在循环中)并将其加载到地图中,以便您可以轻松查找键并找到其值(是,否)。
  3. Have your main loop reading from stdin. 从stdin读取主循环。 Within the loop, for each line (in FileB), check your map (see step 2) to find out whether you have a fruit or not... etc. 在循环内,对于每一行(在FileB中),检查您的地图(参见步骤2)以确定您是否有水果......等。

请看一下这个例子,因为它与您正在寻找的内容非常直接相关。

One approach is to do this as two jobs. 一种方法是将此作为两个工作。

  1. Filter FileB so that only rows containing fruits are retained 过滤FileB,以便仅保留包含水果的行
    • Map1: A composite key of "Food Item" and which file the data came from. Map1:“食物项目”的复合键,以及数据来自哪个文件。 Partition on "Food Item", with a secondary sort whether the row contains "Is_A_Fruit" information (to ensure that this is read first by the reducer for each food item). 对“食品项目”进行分区,并对该行是否包含“Is_A_Fruit”信息进行二次排序(以确保每个食品的减速器首先读取该信息)。
    • Reduce1: With the secondary sort, the first row in the sorted data will indicate whether this Food Item is a fruit (in which case the reducer will output it) or not (in which case it won't). Reduce1:使用二级排序,排序数据中的第一行将指示此Food Item是否为水果(在这种情况下reducer将输出它)或不是(在这种情况下它不会)。
  2. Use the vendor as key to count the number of fruits per vendor. 使用供应商作为密钥来计算每个供应商的水果数量。
    • The MapReduce output from the first job here now has the same structure as FileB, but all rows are Fruits, so this is more like wordcount, with Vendor Name as the key, and then count the number of rows. 现在第一个作业的MapReduce输出与FileB具有相同的结构,但是所有行都是Fruits,所以这更像是wordcount,以Vendor Name作为键,然后计算行数。
    • If you want unique fruits, you may need to use a secondary sort again to eliminate the need to load all fruits associated to each vendor into memory. 如果您需要独特的水果,您可能需要再次使用二级排序,以消除将与每个供应商相关的所有水果加载到内存中的需要。

That said: the solution @cabad suggests is best if the file is small enough. 这就是说:@cabad建议的解决方案最好是文件足够小。

If not, the approach with secondary sorts is best. 如果没有,那么二次分类的方法是最好的。 Take a look at this tutorial suggested in @Simplefish's answer here for how to do secondary sorts within a partition (which are the keywords which will point you in the right direction to do what you want: make guarantees about the order of data associated to a given key that is passed to a reducer). 在@ Simplefish的答案中查看 本教程中的建议,了解如何在分区中进行二级排序(这些关键字将指向正确的方向来执行您想要的操作:保证与数据的顺序相关联给定的密​​钥传递给reducer)。

One final note: your problem isn't "How to read from multiple files" as any solution you devise can't rely on knowing which file the input comes from (you'll need to rely on the structure of the data, although that isn't an issue in this example). 最后一点:你的问题不是“如何从多个文件中读取”,因为你设计的任何解决方案都不能依赖于知道输入来自哪个文件(你需要依赖数据的结构,尽管如此在这个例子中不是问题)。

You will only provide the directory which contains the files, to hadoop. 您只会将包含文件的目录提供给hadoop。 Hadoop framework will read them. Hadoop框架将读取它们。 You don't. 你没有。

Hadoop will apply the map class you code to all files contents. Hadoop会将您编码的地图类应用于所有文件内容。

Then, hadoop will apply reduce class coded by you , to all the output from map class. 然后,hadoop会将您编码的reduce类应用于map类的所有输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM