I've got this from a Hadoop tutorial. It is a reducer that basically takes in (word, count) pairs from stdin and sums them.
def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(separator='\t'):
data = read_mapper_output(sys.stdin, separator=separator)
for current_word, group in groupby(data, itemgetter(0)):
try:
total_uppercount = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
pass
Now, I want to be able to take in tuples (word, count1, count2), but this groupby
and sum(int(count for current_word, count in group)
business is completely illegible to me. How do I modify this chunk so it basically continues doing what it does right now, but with a second counter value? Ie input is (word, count1, count2) and output is (word, count1, count2).
EDIT 1:
from itertools import groupby, izip
from operator import itemgetter
import sys
def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 2)
def main(separator='\t'):
data = read_mapper_output(sys.stdin, separator=separator)
for current_word, group in groupby(data, itemgetter(0)):
try:
counts_a, counts_b = izip((int(count_a), int(count_b)) for current_word, count_a, count_b in group)
t1, t2 = sum(counts_a), sum(counts_b)
print "%s%s%d%s%d" % (current_word, separator, t1, separator, t2)
except ValueError:
pass
This is a Hadoop job, so the output goes like this:
11/11/23 18:44:21 INFO streaming.StreamJob: map 100% reduce 0%
11/11/23 18:44:30 INFO streaming.StreamJob: map 100% reduce 17%
11/11/23 18:44:33 INFO streaming.StreamJob: map 100% reduce 2%
11/11/23 18:44:42 INFO streaming.StreamJob: map 100% reduce 12%
11/11/23 18:44:45 INFO streaming.StreamJob: map 100% reduce 0%
11/11/23 18:44:51 INFO streaming.StreamJob: map 100% reduce 3%
11/11/23 18:44:54 INFO streaming.StreamJob: map 100% reduce 7%
11/11/23 18:44:57 INFO streaming.StreamJob: map 100% reduce 0%
11/11/23 18:45:05 INFO streaming.StreamJob: map 100% reduce 2%
11/11/23 18:45:06 INFO streaming.StreamJob: map 100% reduce 8%
11/11/23 18:45:08 INFO streaming.StreamJob: map 100% reduce 7%
11/11/23 18:45:09 INFO streaming.StreamJob: map 100% reduce 3%
11/11/23 18:45:12 INFO streaming.StreamJob: map 100% reduce 100%
...
11/11/23 18:45:12 ERROR streaming.StreamJob: Job not Successful!
From the logs:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:473)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
groupby
That is the groupby
function from the itertools
module, documented here . The data
is "grouped by" the results of applying itemgetter(0)
(an instance of the itemgetter
class from the operator
module, documented here ) to each element. It returns pairs of (key result, iterator-over-elements-with-that-key). So, each time through the loop, current_word
is the "word" that's common to a bunch of data
lines (the index-0, ie first item, as extracted by the itemgetter
), and group
is an iterator over the data
lines that start with that word
. As described in the documentation for your code, each line of the file has two words: an actual "word" and a count (text intended to be interpreted as a number)
sum(int(count) for current_word, count in group)
That means exactly what it says : the sum of the integer value of the count
, for each ( current_word
, count
) pair found in the group
. Each group
is a set of lines from the data
, as described above. So we take all the lines that started with the current_word
, convert their string count
values to integers, and add them up.
How do I modify this chunk so it basically continues doing what it does right now, but with a second counter value? Ie input is (word, count1, count2) and output is (word, count1, count2).
Well, what do you want each count to represent, and where do you want the data to come from?
I'm going to take what I think is the simplest interpretation: that you're going to modify the data file to have three items on each line, and you're going to take sums from each column of numbers separately.
The groupby
will be the same, because we're still grouping lines that we get in the same way, and we're still grouping them according to the "word".
The sum
part will need to calculate two values: the sum for the first column of numbers and the sum for the second column of numbers.
When we iterate over group
, we'll get sets of three values, so we want to unpack them into three values: current_word, group_a, group_b
for example. For each of these, we want to apply the integer conversion to both numbers on each line. That gives us a sequence-of-pairs-of-numbers; if we want to add all the first numbers and all the second numbers, then we should make a pair-of-sequences-of-numbers instead. To do that, we can use another itertools
function called izip
. We can then sum each of those separately, by unpacking them again into two separate sequence-of-numbers variables, and summing them.
Thus:
counts_a, counts_b = izip(
(int(count_a), int(count_b)) for current_word, count_a, count_b in group
)
total_a, total_b = sum(counts_a), sum(counts_b)
Or we could just make a pair-of-counts by doing the same (x for y in z) trick again:
totals = (
sum(counts)
for counts in izip(
(int(count_a), int(count_b)) for current_word, count_a, count_b in group
)
)
Although that result will be somewhat trickier to use within a print statement :)
from collections import defaultdict
def main(separator='\t'):
data = read_mapper_output(sys.stdin, separator=separator)
counts = defaultdict(lambda: [0, 0])
for word, (count1, count2) in data:
values = counts[word]
values[0] += count1
values[1] += count2
for word, (count1, count2) in counts.iteritems():
print('{0}\t{1}\t{2}'.format(word, count1, count2))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.