简体   繁体   English

如何在MapReduce Python流中对多个字段进行排序?

[英]How to sort with multiple fields in MapReduce Python Streaming?

I'm having a problem with sorting while using MapReduce with streaming and Python. 在将MapReduce与流媒体和Python结合使用时,我在排序时遇到问题。

This is part of a bigger problem, but it can be reduced (no pun intended :) ) to this: 这是一个更大的问题的一部分,但是可以将其简化为:

>> cat inputFile.txt
a       b       1       file1
a       b       2       file1
e       f       0       file2
d       c       3       file3
d       e       2       file4
a       c       5       file5
a       b       3       file1
d       c       2       file3
e       f       2       file2
a       c       4       file5
d       e       10      file4

The first and second columns are the keys. 第一和第二列是键。

I'd like the output of of the map phase to be sorted this way (first by column1, then 2 and then 3 numerically): 我希望地图阶段的输出以这种方式排序(首先按column1,然后按数字2,然后按数字3):

>>sort -k1,1 -k2,2 -k3n,3 inputFile.txt
a       b       1       file1
a       b       2       file1
a       b       3       file1
a       c       4       file5
a       c       5       file5
d       c       2       file3
d       c       3       file3
d       e       2       file4
d       e       10      file4
e       f       0       file2
e       f       2       file2

The forth column here is a hint on how I'd like the files to be for the reduce step, but it's OK if two keys are in the same file (as long as all instances of each key are in a single file). 这里的第四列暗示了我如何将文件还原,但是如果两个键在同一文件中(只要每个键的所有实例都在一个文件中)就可以。 To achieve this I run the following command: 为此,我运行以下命令:

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -D mapred.text.key.comparator.options="-k3,3" -D mapred.text.key.partitioner.options="-k3,3" -mapper cat -reducer cat -input /user/hadoop/inputFile.txt -output /user/hadoop/output

The output of this command is not sorted. 该命令的输出未排序。 For example: 例如:

>>cat output/part-00066
a       b       2       file1
a       b       3       file1
a       b       1       file1

Remarks: 备注:

  • I know that in the above command, I used "-k3,3" and not "-k3n,3" but I just wanted to see if any sort works at first 我知道在上面的命令中,我使用的是“ -k3,3”,而不是“ -k3n,3”,但是我只是想先看看是否有任何排序
  • I tried using "-k1,1,-k2,2 -k3n,3" but I got the same result 我尝试使用“ -k1,1,-k2,2 -k3n,3”,但是得到了相同的结果
  • I tried using 3 for the number of fields and it yielded a result where the keys are in separate files 我尝试使用3作为字段数,并且产生了其中键位于单独文件中的结果

It's like something really basic that I'm missing, what am I doing wrong here? 好像我真的缺少一些基本的东西,我在这里做错了什么?

Thanks a lot for your help! 非常感谢你的帮助!

After trying almost any possible combination, I've found that this works: 在尝试了几乎所有可能的组合之后,我发现这可行:

    hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
    -D \ 
 mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator\ 
-D stream.num.map.output.key.fields=4 \
    -D mapred.text.key.partitioner.options=-k1,2 \
    -D mapred.text.key.comparator.options=-"-k1,1 -k2,2 -k3n,3" \
    -input /user/hadoop/inputFile.txt \
    -output /user/hadoop/output \
    -mapper cat -reducer cat \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Further explanation could be found here : 这里可以找到进一步的解释:

The key (again, no pun intended :) ) is the use of the KeyFieldBasedPartitioner as the partitioner. 关键(同样,没有双关语:))是使用KeyFieldBasedPartitioner作为分区程序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM