[英]Python MapReduce Hadoop Streaming Job that requires multiple input files?
I have two files in my cluster File A
and File B
with the following data - 我的群集
File A
和File B
有两个文件,包含以下数据 -
File A 档案A.
#Format:
#Food Item | Is_A_Fruit (BOOL)
Orange | Yes
Pineapple | Yes
Cucumber | No
Carrot | No
Mango | Yes
File B 档案B.
#Format:
#Food Item | Vendor Name
Orange | Vendor A
Pineapple | Vendor B
Cucumber | Vendor B
Carrot | Vendor B
Mango | Vendor A
Basically I want to find out How many fruits are each vendor selling? 基本上我想知道每个供应商销售多少水果?
Expected output: 预期产量:
Vendor A | 2
Vendor B | 1
I need to do this using hadoop streaming python map reduce. 我需要使用hadoop流式传输python map reduce。
I have read how to do a basic word count, I read from sys.stdin
and emit k,v
pairs for the reducer to then reduce. 我已经阅读了如何进行基本的单词计数,我从
sys.stdin
读取并为reducer发出k,v
对然后减少。
How do I approach this problem? 我该如何处理这个问题?
My main concern is how to read from multiple files and then compare them in Hadoop Streaming. 我主要关注的是如何读取多个文件,然后在Hadoop Streaming中进行比较。
I can do this in normal python (ie without MapReduce & Hadoop, it's straightforward.) but it's infeasible for the sheer size of data that I have with me. 我可以在普通的python中做到这一点(即没有MapReduce和Hadoop,它很简单。)但是对于我拥有的庞大数据大小它是不可行的。
Is File A really that large? 文件A真的那么大吗? I would put it in the DistributedCache and read it from there.
我会把它放在DistributedCache中并从那里读取它。 To put it in the distributed cache, use this option in the Hadoop streaming call:
要将其放入分布式缓存中,请在Hadoop流式调用中使用此选项:
-cacheFile 'hdfs://namenode:port/the/hdfs/path/to/FileA#FileA'
(I suppose the following should work too, but I have not tried it:) (我想以下应该也可以,但我还没试过:)
-cacheFile '/the/hdfs/path/to/FileA#FileA'
Note that the #fileA
is the name you are using to make the file available to your mappers. 请注意,
#fileA
是用于使文件可供映射器使用的名称。
Then, in your mapper, you'll read FileB from sys.stdin
(asuming you called Hadoop Streaming using -input '/user/foo/FileB'
) AND, to read FileA, you should do something like this: 然后,在你的映射器,你会读到FILEB从
sys.stdin
(asuming你叫使用Hadoop流-input '/user/foo/FileB'
)AND,读FILEA,你应该做这样的事情:
f = open('FileA', 'r')
...
f.readline()
Now, I suppose you have already thought of this, but to me it would make sense to have a mapper like this: 现在,我想你已经想到了这一点,但对我来说,有一个像这样的映射器是有道理的:
请看一下这个例子,因为它与您正在寻找的内容非常直接相关。
One approach is to do this as two jobs. 一种方法是将此作为两个工作。
That said: the solution @cabad suggests is best if the file is small enough. 这就是说:@cabad建议的解决方案最好是文件足够小。
If not, the approach with secondary sorts is best. 如果没有,那么二次分类的方法是最好的。 Take a look at this tutorial suggested in @Simplefish's answer here for how to do secondary sorts within a partition (which are the keywords which will point you in the right direction to do what you want: make guarantees about the order of data associated to a given key that is passed to a reducer).
在@ Simplefish的答案中查看 本教程中的建议,了解如何在分区中进行二级排序(这些关键字将指向正确的方向来执行您想要的操作:保证与数据的顺序相关联给定的密钥传递给reducer)。
One final note: your problem isn't "How to read from multiple files" as any solution you devise can't rely on knowing which file the input comes from (you'll need to rely on the structure of the data, although that isn't an issue in this example). 最后一点:你的问题不是“如何从多个文件中读取”,因为你设计的任何解决方案都不能依赖于知道输入来自哪个文件(你需要依赖数据的结构,尽管如此在这个例子中不是问题)。
You will only provide the directory which contains the files, to hadoop. 您只会将包含文件的目录提供给hadoop。 Hadoop framework will read them.
Hadoop框架将读取它们。 You don't.
你没有。
Hadoop will apply the map class you code to all files contents. Hadoop会将您编码的地图类应用于所有文件内容。
Then, hadoop will apply reduce class coded by you , to all the output from map class. 然后,hadoop会将您编码的reduce类应用于map类的所有输出。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.