简体   繁体   中英

Groupby and list comprehension headache in python

I've got this from a Hadoop tutorial. It is a reducer that basically takes in (word, count) pairs from stdin and sums them.

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='\t'):
    data = read_mapper_output(sys.stdin, separator=separator)
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_uppercount = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            pass

Now, I want to be able to take in tuples (word, count1, count2), but this groupby and sum(int(count for current_word, count in group) business is completely illegible to me. How do I modify this chunk so it basically continues doing what it does right now, but with a second counter value? Ie input is (word, count1, count2) and output is (word, count1, count2).

EDIT 1:

from itertools import groupby, izip
from operator import itemgetter
import sys

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 2)

def main(separator='\t'):
    data = read_mapper_output(sys.stdin, separator=separator)
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            counts_a, counts_b = izip((int(count_a), int(count_b)) for current_word, count_a, count_b in group)
            t1, t2 = sum(counts_a), sum(counts_b)
            print "%s%s%d%s%d" % (current_word, separator, t1, separator, t2)
        except ValueError:
            pass

This is a Hadoop job, so the output goes like this:

11/11/23 18:44:21 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/23 18:44:30 INFO streaming.StreamJob:  map 100%  reduce 17%
11/11/23 18:44:33 INFO streaming.StreamJob:  map 100%  reduce 2%
11/11/23 18:44:42 INFO streaming.StreamJob:  map 100%  reduce 12%
11/11/23 18:44:45 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/23 18:44:51 INFO streaming.StreamJob:  map 100%  reduce 3%
11/11/23 18:44:54 INFO streaming.StreamJob:  map 100%  reduce 7%
11/11/23 18:44:57 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/23 18:45:05 INFO streaming.StreamJob:  map 100%  reduce 2%
11/11/23 18:45:06 INFO streaming.StreamJob:  map 100%  reduce 8%
11/11/23 18:45:08 INFO streaming.StreamJob:  map 100%  reduce 7%
11/11/23 18:45:09 INFO streaming.StreamJob:  map 100%  reduce 3%
11/11/23 18:45:12 INFO streaming.StreamJob:  map 100%  reduce 100%
...
11/11/23 18:45:12 ERROR streaming.StreamJob: Job not Successful!

From the logs:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

groupby

That is the groupby function from the itertools module, documented here . The data is "grouped by" the results of applying itemgetter(0) (an instance of the itemgetter class from the operator module, documented here ) to each element. It returns pairs of (key result, iterator-over-elements-with-that-key). So, each time through the loop, current_word is the "word" that's common to a bunch of data lines (the index-0, ie first item, as extracted by the itemgetter ), and group is an iterator over the data lines that start with that word . As described in the documentation for your code, each line of the file has two words: an actual "word" and a count (text intended to be interpreted as a number)

sum(int(count) for current_word, count in group)

That means exactly what it says : the sum of the integer value of the count , for each ( current_word , count ) pair found in the group . Each group is a set of lines from the data , as described above. So we take all the lines that started with the current_word , convert their string count values to integers, and add them up.

How do I modify this chunk so it basically continues doing what it does right now, but with a second counter value? Ie input is (word, count1, count2) and output is (word, count1, count2).

Well, what do you want each count to represent, and where do you want the data to come from?

I'm going to take what I think is the simplest interpretation: that you're going to modify the data file to have three items on each line, and you're going to take sums from each column of numbers separately.

The groupby will be the same, because we're still grouping lines that we get in the same way, and we're still grouping them according to the "word".

The sum part will need to calculate two values: the sum for the first column of numbers and the sum for the second column of numbers.

When we iterate over group , we'll get sets of three values, so we want to unpack them into three values: current_word, group_a, group_b for example. For each of these, we want to apply the integer conversion to both numbers on each line. That gives us a sequence-of-pairs-of-numbers; if we want to add all the first numbers and all the second numbers, then we should make a pair-of-sequences-of-numbers instead. To do that, we can use another itertools function called izip . We can then sum each of those separately, by unpacking them again into two separate sequence-of-numbers variables, and summing them.

Thus:

counts_a, counts_b = izip(
    (int(count_a), int(count_b)) for current_word, count_a, count_b in group
)
total_a, total_b = sum(counts_a), sum(counts_b)

Or we could just make a pair-of-counts by doing the same (x for y in z) trick again:

totals = (
    sum(counts)
    for counts in izip(
        (int(count_a), int(count_b)) for current_word, count_a, count_b in group
    )
)

Although that result will be somewhat trickier to use within a print statement :)

from collections import defaultdict

def main(separator='\t'):
    data = read_mapper_output(sys.stdin, separator=separator)
    counts = defaultdict(lambda: [0, 0])
    for word, (count1, count2) in data:
        values = counts[word]
        values[0] += count1
        values[1] += count2

    for word, (count1, count2) in counts.iteritems():
        print('{0}\t{1}\t{2}'.format(word, count1, count2))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM