[英]Mapreduce for word count on specific column in txt file
我有 mapper 和 reducer 代碼來查找文本文件中最常用的單詞。 我想在特定列中輸出我的文本文件中最常見的單詞/單詞。 txt 文件中列的名稱是“流派”。 該列有多個以逗號分隔的字符串。 這是我的 txt 文件示例:
tconst averageRating numVotes titleType primaryTitle startYear genres
tt0002020 5.2 85 short Queen Elizabeth 1912 Biography,Drama,History
tt0002026 4 7 movie Anny - Story of a Prostitute 1912 Drama,Romance
tt0002029 6.1 33 short Poor Jenny 1912 Short
tt0002031 4.6 8 movie As You Like It 1912 \N
tt0002033 5.6 26 short Asesinato y entierro de Don JosŽ Canalejas 1912 Short
tt0002034 4.9 17 short At Coney Island 1912 Comedy,Short
tt0002041 3.9 14 short The Baby and the Stork 1912 Crime,Drama,Short
tt0002045 4.2 71 short The Ball Player and the Bandit 1912 Drama,Romance,Short
//Mapper code
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
//Reducer
from itertools import groupby
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
max_count = 0
max_word = None
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
# check if new word greater
if current_count > max_count:
max_count= current_count
max_word = current_word
current_count = count
current_word = word
# do not forget to check last word if needed!
if current_count > max_count:
max_count= current_count
max_word = current_word
print '%s\t%s' % (max_word, max_count)
您能否指導我如何更改此代碼以在“流派”列中打印最常用的單詞。 我還想輸出中所有單詞的字數,如果我需要提供任何其他內容,請告訴我。
嘗試使用行變量索引的倍數。 使用您想要查找最頻繁單詞的任何特定列的索引。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.