簡體   English   中英

Mapreduce 用於 txt 文件中特定列的字數統計

[英]Mapreduce for word count on specific column in txt file

我有 mapper 和 reducer 代碼來查找文本文件中最常用的單詞。 我想在特定列中輸出我的文本文件中最常見的單詞/單詞。 txt 文件中列的名稱是“流派”。 該列有多個以逗號分隔的字符串。 這是我的 txt 文件示例:

tconst  averageRating   numVotes    titleType   primaryTitle    startYear   genres
tt0002020   5.2 85  short   Queen Elizabeth 1912    Biography,Drama,History
tt0002026   4   7   movie   Anny - Story of a Prostitute    1912    Drama,Romance
tt0002029   6.1 33  short   Poor Jenny  1912    Short
tt0002031   4.6 8   movie   As You Like It  1912    \N
tt0002033   5.6 26  short   Asesinato y entierro de Don JosŽ Canalejas  1912    Short
tt0002034   4.9 17  short   At Coney Island 1912    Comedy,Short
tt0002041   3.9 14  short   The Baby and the Stork  1912    Crime,Drama,Short
tt0002045   4.2 71  short   The Ball Player and the Bandit  1912    Drama,Romance,Short

    //Mapper code   
    import sys

        def read_input(file):
            for line in file:
                # split the line into words
                yield line.split()

        def main(separator='\t'):
            # input comes from STDIN (standard input)
            data = read_input(sys.stdin)
            for words in data:
                # write the results to STDOUT (standard output);
                # what we output here will be the input for the
                # Reduce step, i.e. the input for reducer.py
                #
                # tab-delimited; the trivial word count is 1
                for word in words:
                    print '%s%s%d' % (word, separator, 1)

        if __name__ == "__main__":
            main()



 //Reducer
    from itertools import groupby
    from operator import itemgetter
    import sys

    current_word = None
    current_count = 0
    word = None
    max_count = 0
    max_word = None

    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()

        # parse the input we got from mapper.py
        word, count = line.split('\t', 1)

        # convert count (currently a string) to int
        try:
            count = int(count)
        except ValueError:
            # count was not a number, so silently
            # ignore/discard this line
            continue

        # this IF-switch only works because Hadoop sorts map output
        # by key (here: word) before it is passed to the reducer
        if current_word == word:
            current_count += count
        else:
            # check if new word greater
            if current_count > max_count:
                max_count= current_count 
                max_word = current_word        
            current_count = count
            current_word = word

    # do not forget to check last word if needed!
    if current_count > max_count:
        max_count= current_count 
        max_word = current_word

    print '%s\t%s' % (max_word, max_count)

您能否指導我如何更改此代碼以在“流派”列中打印最常用的單詞。 我還想輸出中所有單詞的字數,如果我需要提供任何其他內容,請告訴我。

嘗試使用行變量索引的倍數。 使用您想要查找最頻繁單詞的任何特定列的索引。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM