[英]MRJob sort reducer output
Is there any way to sort the output of reducer function using mrjob? 有什么办法可以使用mrjob对reducer函数的输出进行排序?
I think that the input to reducer function is sorted by the key and I tried to exploit this feature to sort the output using another reducer like below where I know values have numeric values, I want to count number of each key and sort keys according to this count: 我认为减速器功能的输入是按键排序的,我试图利用此功能使用另一个约化器对输出进行排序,如下所示,我知道值具有数字值,我想对每个键的数量进行计数并根据此计数:
def mapper_1(self, key, line):
key = #extract key from the line
yield (key, 1)
def reducer_1(self, key, values):
yield key, sum(values)
def mapper_2(self, key, count):
yield ('%020d' % int(count), key)
def reducer_2(self, count, keys):
for key in keys:
yield key, int(count)
but it's output is not correctly sorted! 但是它的输出没有正确排序! I suspected that this weird behavior is due to manipulating
int
s as string
and tried to format it as this link says but It didn't worked! 我怀疑这种怪异的行为是由于将
int
s操纵为string
并试图按照此链接所说的那样对其进行格式化,但这没有用!
IMPORTANT NOTE: When I use the debugger to see the order of output of reducer_2
the order is correct but what is printed as output is something else!!! 重要说明:当我使用调试器查看
reducer_2
的输出顺序时,该顺序是正确的,但是输出显示的内容是另外的东西!!!
IMPORTANT NOTE 2: On another computer the same program on the same data returns output sorted as expected! 重要说明2:在另一台计算机上,对相同数据的相同程序将返回按预期排序的输出!
You can sort the values as integers in second reducer and then converting them in to the zero padded representation: 您可以在第二个reducer中将这些值排序为整数,然后将它们转换为零填充表示形式:
import re
from mrjob.job import MRJob
from mrjob.step import MRStep
WORD_RE = re.compile(r"[\w']+")
class MRWordFrequencyCount(MRJob):
def steps(self):
return [
MRStep(
mapper=self.mapper_extract_words, combiner=self.combine_word_counts,
reducer=self.reducer_sum_word_counts
),
MRStep(
reducer=self.reduce_sort_counts
)
]
def mapper_extract_words(self, _, line):
for word in WORD_RE.findall(line):
yield word.lower(), 1
def combine_word_counts(self, word, counts):
yield word, sum(counts)
def reducer_sum_word_counts(self, key, values):
yield None, (sum(values), key)
def reduce_sort_counts(self, _, word_counts):
for count, key in sorted(word_counts, reverse=True):
yield ('%020d' % int(count), key)
Well this is sorting the output in memory, which migtht be a problem depending on the size of the input. 好吧,这是对内存中的输出进行排序,这可能会成为问题,具体取决于输入的大小。 But you want it sorted so it has to be sorted somehow.
但是您希望对其进行排序,因此必须以某种方式对其进行排序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.