简体   繁体   中英

MRJob sort reducer output

Is there any way to sort the output of reducer function using mrjob?

I think that the input to reducer function is sorted by the key and I tried to exploit this feature to sort the output using another reducer like below where I know values have numeric values, I want to count number of each key and sort keys according to this count:

def mapper_1(self, key, line):
    key = #extract key from the line
    yield (key, 1)

def reducer_1(self, key, values):
    yield key, sum(values)

def mapper_2(self, key, count):
    yield ('%020d' % int(count), key)

def reducer_2(self, count, keys):
    for key in keys:
        yield key, int(count)

but it's output is not correctly sorted! I suspected that this weird behavior is due to manipulating int s as string and tried to format it as this link says but It didn't worked!

IMPORTANT NOTE: When I use the debugger to see the order of output of reducer_2 the order is correct but what is printed as output is something else!!!

IMPORTANT NOTE 2: On another computer the same program on the same data returns output sorted as expected!

You can sort the values as integers in second reducer and then converting them in to the zero padded representation:

import re

from mrjob.job import MRJob
from mrjob.step import MRStep

WORD_RE = re.compile(r"[\w']+")


class MRWordFrequencyCount(MRJob):

    def steps(self):
        return [
            MRStep(
                mapper=self.mapper_extract_words, combiner=self.combine_word_counts,
                reducer=self.reducer_sum_word_counts
            ),
            MRStep(
                reducer=self.reduce_sort_counts
            )
        ]

    def mapper_extract_words(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combine_word_counts(self, word, counts):
        yield word, sum(counts)

    def reducer_sum_word_counts(self, key, values):
        yield None, (sum(values), key)

    def reduce_sort_counts(self, _, word_counts):
        for count, key in sorted(word_counts, reverse=True):
            yield ('%020d' % int(count), key)

Well this is sorting the output in memory, which migtht be a problem depending on the size of the input. But you want it sorted so it has to be sorted somehow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM