Map-Reduce/Hadoop sort by integer value (using MRJob)

Question

This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py :

from mrjob.job import MRJob

class Beta(MRJob):
    def mapper(self, _, line):
        """
        """
        l = line.split(' ')
        yield l[1], l[0]

    def reducer(self, key, val):
        yield key, [v for v in val][0]


if __name__ == '__main__':
    Beta.run()

I run it using the text:

One can run this using:

cat <filename> | python beta.py

Now the issue is the output is sorted assuming that the key is of type string (which is probably the case here). The output is:

"1"     "1"
"10"    "6"
"11"    "7"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"

The output that I want is:

"1"     "1"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"
"10"    "6"
"11"    "7"

I am not sure if this is to do with fiddling with protocols in MRJob as protocols are job specific and not step specific.

EDIT (Solution): I have got the answer for this one. The idea is that one needs to prepend 'O-bytes' to every number such that the number of bytes in every number is same the number of bytes in the largest number. At least that's what I remembered from my classes. I cannot add the answer right now as it won't permit me but this is the only solution I've got. If anyone's got something more transparent and easy, please share.

Answer 1

Simple solution (more robust might be based on tuning how Hadoop is sorting mapper output)

class Beta(MRJob):

    def mapper (self, _, line):
        l = line.strip('\n').split()
        yield '%010d'%int(l[1]), l[0]

    def reducer(self, key, values):
        yield int(key),int(list(values)[0])

Map-Reduce/Hadoop sort by integer value (using MRJob)

Question

1 answers

solution1
2 2014-05-01 12:56:08

Map-Reduce/Hadoop sort by integer value (using MRJob)

Question

1 answers

solution1 2 2014-05-01 12:56:08

solution1
2 2014-05-01 12:56:08