简体   繁体   English

在python中的Groupby和列表理解头痛

[英]Groupby and list comprehension headache in python

I've got this from a Hadoop tutorial. 我从Hadoop教程中得到了这个。 It is a reducer that basically takes in (word, count) pairs from stdin and sums them. 它是一个reducer,它基本上从stdin接收(word,count)对并对它们求和。

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='\t'):
    data = read_mapper_output(sys.stdin, separator=separator)
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_uppercount = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            pass

Now, I want to be able to take in tuples (word, count1, count2), but this groupby and sum(int(count for current_word, count in group) business is completely illegible to me. How do I modify this chunk so it basically continues doing what it does right now, but with a second counter value? Ie input is (word, count1, count2) and output is (word, count1, count2). 现在,我希望能够接受元组(word,count1,count2),但是这个groupbysum(int(count for current_word, count in group)业务对我来说完全难以辨认。如何修改这个块以便它基本上继续做它现在做的事情,但是有第二个计数器值?即输入是(word,count1,count2),输出是(word,count1,count2)。

EDIT 1: 编辑1:

from itertools import groupby, izip
from operator import itemgetter
import sys

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 2)

def main(separator='\t'):
    data = read_mapper_output(sys.stdin, separator=separator)
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            counts_a, counts_b = izip((int(count_a), int(count_b)) for current_word, count_a, count_b in group)
            t1, t2 = sum(counts_a), sum(counts_b)
            print "%s%s%d%s%d" % (current_word, separator, t1, separator, t2)
        except ValueError:
            pass

This is a Hadoop job, so the output goes like this: 这是一个Hadoop作业,所以输出如下:

11/11/23 18:44:21 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/23 18:44:30 INFO streaming.StreamJob:  map 100%  reduce 17%
11/11/23 18:44:33 INFO streaming.StreamJob:  map 100%  reduce 2%
11/11/23 18:44:42 INFO streaming.StreamJob:  map 100%  reduce 12%
11/11/23 18:44:45 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/23 18:44:51 INFO streaming.StreamJob:  map 100%  reduce 3%
11/11/23 18:44:54 INFO streaming.StreamJob:  map 100%  reduce 7%
11/11/23 18:44:57 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/23 18:45:05 INFO streaming.StreamJob:  map 100%  reduce 2%
11/11/23 18:45:06 INFO streaming.StreamJob:  map 100%  reduce 8%
11/11/23 18:45:08 INFO streaming.StreamJob:  map 100%  reduce 7%
11/11/23 18:45:09 INFO streaming.StreamJob:  map 100%  reduce 3%
11/11/23 18:45:12 INFO streaming.StreamJob:  map 100%  reduce 100%
...
11/11/23 18:45:12 ERROR streaming.StreamJob: Job not Successful!

From the logs: 从日志:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

groupby 通过...分组

That is the groupby function from the itertools module, documented here . 这是itertools模块中的groupby函数,在此处记录 The data is "grouped by" the results of applying itemgetter(0) (an instance of the itemgetter class from the operator module, documented here ) to each element. data “按”将itemgetter(0) (来自operator模块的itemgetter类的实例,在此记录 )的结果“分组”到每个元素。 It returns pairs of (key result, iterator-over-elements-with-that-key). 它返回成对的(关键结果,iterator-over-elements-with-key)。 So, each time through the loop, current_word is the "word" that's common to a bunch of data lines (the index-0, ie first item, as extracted by the itemgetter ), and group is an iterator over the data lines that start with that word . 因此,每次循环时, current_word是一堆data行共有的“字”(index-0,即第一项,由itemgetter提取), groupdata行上的迭代器用那个word As described in the documentation for your code, each line of the file has two words: an actual "word" and a count (text intended to be interpreted as a number) 如代码文档中所述,文件的每一行都有两个单词:实际的“单词”和计数(用于解释为数字的文本)

sum(int(count) for current_word, count in group) sum(current_word的int(count),group中的count)

That means exactly what it says : the sum of the integer value of the count , for each ( current_word , count ) pair found in the group . 意味着正是它说 :对的整数值的总和count ,每个( current_wordcount中找到)对group Each group is a set of lines from the data , as described above. 如上所述,每个group是来自data一组线。 So we take all the lines that started with the current_word , convert their string count values to integers, and add them up. 因此,我们采用以current_word开头的所有行,将其字符串count数值转换为整数,然后将它们相加。

How do I modify this chunk so it basically continues doing what it does right now, but with a second counter value? 我如何修改这个块,以便它基本上继续做它现在做的事情,但是有第二个计数器值? Ie input is (word, count1, count2) and output is (word, count1, count2). 即输入是(word,count1,count2),输出是(word,count1,count2)。

Well, what do you want each count to represent, and where do you want the data to come from? 那么,您希望每个计数代表什么,以及您希望数据来自何处?

I'm going to take what I think is the simplest interpretation: that you're going to modify the data file to have three items on each line, and you're going to take sums from each column of numbers separately. 我将采用我认为最简单的解释:您将修改数据文件以在每一行上有三个项目,并且您将分别从每列数字中获取总和。

The groupby will be the same, because we're still grouping lines that we get in the same way, and we're still grouping them according to the "word". groupby将是相同的,因为我们仍然以相同的方式对行进行分组,并且我们仍然根据“单词”对它们进行分组。

The sum part will need to calculate two values: the sum for the first column of numbers and the sum for the second column of numbers. sum部分需要计算两个值:第一列数字的总和和第二列数字的总和。

When we iterate over group , we'll get sets of three values, so we want to unpack them into three values: current_word, group_a, group_b for example. 当我们遍历group ,我们将获得三个值的集合,因此我们想要将它们解压缩为三个值:例如current_word, group_a, group_b For each of these, we want to apply the integer conversion to both numbers on each line. 对于其中的每一个,我们希望将整数转换应用于每行上的两个数字。 That gives us a sequence-of-pairs-of-numbers; 这给了我们一系列数字序列; if we want to add all the first numbers and all the second numbers, then we should make a pair-of-sequences-of-numbers instead. 如果我们想要添加所有第一个数字和所有第二个数字,那么我们应该制作一对数字序列。 To do that, we can use another itertools function called izip . 为此,我们可以使用另一个名为izip itertools函数。 We can then sum each of those separately, by unpacking them again into two separate sequence-of-numbers variables, and summing them. 然后我们可以将它们分别相加,将它们再次打包成两个独立的数字序列变量,并对它们求和。

Thus: 从而:

counts_a, counts_b = izip(
    (int(count_a), int(count_b)) for current_word, count_a, count_b in group
)
total_a, total_b = sum(counts_a), sum(counts_b)

Or we could just make a pair-of-counts by doing the same (x for y in z) trick again: 或者我们可以通过再次执行相同的操作(x中的y代表y)来制作一对计数:

totals = (
    sum(counts)
    for counts in izip(
        (int(count_a), int(count_b)) for current_word, count_a, count_b in group
    )
)

Although that result will be somewhat trickier to use within a print statement :) 虽然在打印语句中使用该结果会有些棘手:)

from collections import defaultdict

def main(separator='\t'):
    data = read_mapper_output(sys.stdin, separator=separator)
    counts = defaultdict(lambda: [0, 0])
    for word, (count1, count2) in data:
        values = counts[word]
        values[0] += count1
        values[1] += count2

    for word, (count1, count2) in counts.iteritems():
        print('{0}\t{1}\t{2}'.format(word, count1, count2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM