[英]How to sort with multiple fields in MapReduce Python Streaming?
I'm having a problem with sorting while using MapReduce with streaming and Python. 在将MapReduce与流媒体和Python结合使用时,我在排序时遇到问题。
This is part of a bigger problem, but it can be reduced (no pun intended :) ) to this: 这是一个更大的问题的一部分,但是可以将其简化为:
>> cat inputFile.txt
a b 1 file1
a b 2 file1
e f 0 file2
d c 3 file3
d e 2 file4
a c 5 file5
a b 3 file1
d c 2 file3
e f 2 file2
a c 4 file5
d e 10 file4
The first and second columns are the keys. 第一和第二列是键。
I'd like the output of of the map phase to be sorted this way (first by column1, then 2 and then 3 numerically): 我希望地图阶段的输出以这种方式排序(首先按column1,然后按数字2,然后按数字3):
>>sort -k1,1 -k2,2 -k3n,3 inputFile.txt
a b 1 file1
a b 2 file1
a b 3 file1
a c 4 file5
a c 5 file5
d c 2 file3
d c 3 file3
d e 2 file4
d e 10 file4
e f 0 file2
e f 2 file2
The forth column here is a hint on how I'd like the files to be for the reduce step, but it's OK if two keys are in the same file (as long as all instances of each key are in a single file). 这里的第四列暗示了我如何将文件还原,但是如果两个键在同一文件中(只要每个键的所有实例都在一个文件中)就可以。 To achieve this I run the following command:
为此,我运行以下命令:
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -D mapred.text.key.comparator.options="-k3,3" -D mapred.text.key.partitioner.options="-k3,3" -mapper cat -reducer cat -input /user/hadoop/inputFile.txt -output /user/hadoop/output
The output of this command is not sorted. 该命令的输出未排序。 For example:
例如:
>>cat output/part-00066
a b 2 file1
a b 3 file1
a b 1 file1
Remarks: 备注:
It's like something really basic that I'm missing, what am I doing wrong here? 好像我真的缺少一些基本的东西,我在这里做错了什么?
Thanks a lot for your help! 非常感谢你的帮助!
After trying almost any possible combination, I've found that this works: 在尝试了几乎所有可能的组合之后,我发现这可行:
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-D \
mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator\
-D stream.num.map.output.key.fields=4 \
-D mapred.text.key.partitioner.options=-k1,2 \
-D mapred.text.key.comparator.options=-"-k1,1 -k2,2 -k3n,3" \
-input /user/hadoop/inputFile.txt \
-output /user/hadoop/output \
-mapper cat -reducer cat \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Further explanation could be found here : 在这里可以找到进一步的解释:
The key (again, no pun intended :) ) is the use of the KeyFieldBasedPartitioner as the partitioner. 关键(同样,没有双关语:))是使用KeyFieldBasedPartitioner作为分区程序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.