MRJob分拣减速器输出

Question

有什么办法可以使用mrjob对reducer函数的输出进行排序？

我认为减速器功能的输入是按键排序的，我试图利用此功能使用另一个约化器对输出进行排序，如下所示，我知道值具有数字值，我想对每个键的数量进行计数并根据此计数：

def mapper_1(self, key, line):
    key = #extract key from the line
    yield (key, 1)

def reducer_1(self, key, values):
    yield key, sum(values)

def mapper_2(self, key, count):
    yield ('%020d' % int(count), key)

def reducer_2(self, count, keys):
    for key in keys:
        yield key, int(count)

但是它的输出没有正确排序！ 我怀疑这种怪异的行为是由于将int s操纵为string并试图按照此链接所说的那样对其进行格式化，但这没有用！

重要说明：当我使用调试器查看reducer_2的输出顺序时，该顺序是正确的，但是输出显示的内容是另外的东西！！！

重要说明2：在另一台计算机上，对相同数据的相同程序将返回按预期排序的输出！

Answer 1

您可以在第二个reducer中将这些值排序为整数，然后将它们转换为零填充表示形式：

import re

from mrjob.job import MRJob
from mrjob.step import MRStep

WORD_RE = re.compile(r"[\w']+")


class MRWordFrequencyCount(MRJob):

    def steps(self):
        return [
            MRStep(
                mapper=self.mapper_extract_words, combiner=self.combine_word_counts,
                reducer=self.reducer_sum_word_counts
            ),
            MRStep(
                reducer=self.reduce_sort_counts
            )
        ]

    def mapper_extract_words(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combine_word_counts(self, word, counts):
        yield word, sum(counts)

    def reducer_sum_word_counts(self, key, values):
        yield None, (sum(values), key)

    def reduce_sort_counts(self, _, word_counts):
        for count, key in sorted(word_counts, reverse=True):
            yield ('%020d' % int(count), key)

好吧，这是对内存中的输出进行排序，这可能会成为问题，具体取决于输入的大小。 但是您希望对其进行排序，因此必须以某种方式对其进行排序。

MRJob分拣减速器输出

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-12-18 20:53:29

MRJob分拣减速器输出

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-12-18 20:53:29

解决方案1
2 已采纳 2018-12-18 20:53:29