简体   繁体   中英

Map-Reduce/Hadoop sort by integer value (using MRJob)

This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py :

from mrjob.job import MRJob

class Beta(MRJob):
    def mapper(self, _, line):
        """
        """
        l = line.split(' ')
        yield l[1], l[0]

    def reducer(self, key, val):
        yield key, [v for v in val][0]


if __name__ == '__main__':
    Beta.run()

I run it using the text:

1 1
2 4
3 8
4 2
4 7
5 5
6 10
7 11

One can run this using:

cat <filename> | python beta.py

Now the issue is the output is sorted assuming that the key is of type string (which is probably the case here). The output is:

"1"     "1"
"10"    "6"
"11"    "7"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"

The output that I want is:

"1"     "1"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"
"10"    "6"
"11"    "7"

I am not sure if this is to do with fiddling with protocols in MRJob as protocols are job specific and not step specific.

EDIT (Solution): I have got the answer for this one. The idea is that one needs to prepend 'O-bytes' to every number such that the number of bytes in every number is same the number of bytes in the largest number. At least that's what I remembered from my classes. I cannot add the answer right now as it won't permit me but this is the only solution I've got. If anyone's got something more transparent and easy, please share.

Simple solution (more robust might be based on tuning how Hadoop is sorting mapper output)

class Beta(MRJob):

    def mapper (self, _, line):
        l = line.strip('\n').split()
        yield '%010d'%int(l[1]), l[0]

    def reducer(self, key, values):
        yield int(key),int(list(values)[0])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM