[英]Map Reduce error in Python
以下是python中的map-reduce腳本,用於查找特定列的計數。 輸入文件如下:
301838690844557314|#awkwarddate first date at quiznoes|334910568|gabriellarichh|20130213|awkwarddate|Point|40.456664|-74.265167
301838679280861185|RT @jimmyfallon: Ended my very first date by saying, "Take it easy." And then my dad drove me home. #awkwarddate|618516844|heyitsbrooke456|20130213|awkwarddate|NULL|NULL|NULL
301838678026768384|RT @jimmyfallon: Hashtag game! Tweet out a funny, weird, or embarrassing story about a date you've been on and tag w/ #awkwarddate. Could be on our show!|116973704|VegasPhotog|20130213|awkwarddate|NULL|NULL|NULL
map reduce腳本是:map代碼是:
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split('|')
def main(separator='|'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
if len(words)==9:
for word[5] in words:
print '%s%s%d' % (word[5], separator, 1)
if __name__ == "__main__":
main()
減少代碼是:
from itertools import groupby
from operator import itemgetter
import sys
def read_mapper_output(file, separator='|'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(separator='|'):
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin, separator=separator)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
# current_word - string containing a word (the key)
# group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass
if __name__ == "__main__":
main()
當我在hadoop中運行上述腳本時,出現以下錯誤:
failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201302281754_0001_m_000000
我想獲取第6列(awkwarddate)的計數,將不勝感激。 提前致謝
讓我們進一步細分一下,因為您想計算每行的第6列。
def to_std_out(line, sep='|'):
words = line.split(sep)
print '%s%s%d' % (words[5], sep, 1)
現在在您的腳本入口點(在您的情況下表示為main()
函數)中,邏輯可能是:
def main(sep='|'):
# read in stdin line by line, usually you might need an indefinite loop
# for the sake of simplicity, I'll be using while 1
while 1:
lines = sys.stdin.readlines()
for line in lines:
to_std_out(line, sep)
希望這就是您要尋找的:)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.