包含HTML標記的文件上的Hadoop MapReduce作業

Question

我有一堆大型HTML文件，我想在它們上運行Hadoop MapReduce作業以查找最常用的單詞。 我用Python編寫了mapper和reducer，並使用Hadoop流來運行它們。

這是我的映射器：

#!/usr/bin/env python

import sys
import re
import string

def remove_html_tags(in_text):
'''
Remove any HTML tags that are found. 

'''
    global flag
    in_text=in_text.lstrip()
    in_text=in_text.rstrip()
    in_text=in_text+"\n"

    if flag==True: 
        in_text="<"+in_text
        flag=False
    if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None: 
        in_text=in_text+">"
        flag=True
    p = re.compile(r'<[^<]*?>')
    in_text=p.sub('', in_text)
    return in_text

# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
    # remove leading and trailing whitespace, set to lowercase and remove HTMl tags
    line = line.strip().lower()
    line = remove_html_tags(line)
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
       # write the results to STDOUT (standard output);
       # what we output here will be the input for the
       # Reduce step, i.e. the input for reducer.py
       #
       # tab-delimited; the trivial word count is 1
       if word =='': continue
       for c in string.punctuation:
           word= word.replace(c,'')

       print '%s\t%s' % (word, 1)

這是我的減速器：

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        pass

sorted_word2count = sorted(word2count.iteritems(), 
key=lambda(k,v):(v,k),reverse=True)

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print '%s\t%s'% (word, count)

每當我通過管道傳送一個小的示例小字符串（例如“ hello world hello hello world ...”）時，我都會獲得適當的排名列表輸出。 但是，當我嘗試使用一個小的HTML文件並嘗試使用cat將HTML管道傳輸到我的映射器時，出現以下錯誤（input2包含一些HTML代碼）：

rohanbk@hadoop:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
  File "/home/rohanbk/reducer.py", line 15, in <module>
    word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack

誰能解釋我為什么得到這個？ 另外，調試MapReduce作業程序的好方法是什么？

Answer 1

您甚至可以重現該錯誤：

echo "hello - world" | ./mapper.py  | sort | ./reducer.py

問題在這里：

if word =='': continue
for c in string.punctuation:
           word= word.replace(c,'')

如果word是單個標點符號（與上述輸入一樣）（拆分后），那么它將轉換為空字符串。 因此，只需將替換后的空字符串檢查移至即可。

包含HTML標記的文件上的Hadoop MapReduce作業

問題描述

1 個解決方案

解決方案1
1 已采納 2009-12-03 21:53:22

包含HTML標記的文件上的Hadoop MapReduce作業

問題描述

1 個解決方案

解決方案1 1 已采納 2009-12-03 21:53:22

解決方案1
1 已采納 2009-12-03 21:53:22