Python中的Map Reduce錯誤

Question

以下是python中的map-reduce腳本，用於查找特定列的計數。 輸入文件如下：

 301838690844557314|#awkwarddate first date at     quiznoes|334910568|gabriellarichh|20130213|awkwarddate|Point|40.456664|-74.265167
 301838679280861185|RT @jimmyfallon: Ended my very first date by saying, "Take it easy." And  then my dad drove me home.  #awkwarddate|618516844|heyitsbrooke456|20130213|awkwarddate|NULL|NULL|NULL
 301838678026768384|RT @jimmyfallon: Hashtag game! Tweet out a funny, weird, or embarrassing story about a date you've been on and tag w/ #awkwarddate. Could be on our show!|116973704|VegasPhotog|20130213|awkwarddate|NULL|NULL|NULL

map reduce腳本是：map代碼是：

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split('|')

def main(separator='|'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        if len(words)==9:
            for word[5] in words:
                print '%s%s%d' % (word[5], separator, 1)

if __name__ == "__main__":
    main()

減少代碼是：

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='|'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='|'):
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    # groupby groups multiple word-count pairs by word,
    # and creates an iterator that returns consecutive keys and their group:
    #   current_word - string containing a word (the key)
    #   group - iterator yielding all ["&lt;current_word&gt;", "&lt;count&gt;"] items
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

當我在hadoop中運行上述腳本時，出現以下錯誤：

failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201302281754_0001_m_000000

我想獲取第6列（awkwarddate）的計數，將不勝感激。 提前致謝

Answer 1

讓我們進一步細分一下，因為您想計算每行的第6列。

def to_std_out(line, sep='|'):
    words = line.split(sep)

    print '%s%s%d' % (words[5], sep, 1)

現在在您的腳本入口點（在您的情況下表示為main()函數）中，邏輯可能是：

def main(sep='|'):
    # read in stdin line by line, usually you might need an indefinite loop
    # for the sake of simplicity, I'll be using while 1
    while 1:
        lines = sys.stdin.readlines()
        for line in lines:
            to_std_out(line, sep)

希望這就是您要尋找的:)

Python中的Map Reduce錯誤

問題描述

1 個解決方案

解決方案1
0 2013-02-28 15:15:57

Python中的Map Reduce錯誤

問題描述

1 個解決方案

解決方案1 0 2013-02-28 15:15:57

解決方案1
0 2013-02-28 15:15:57