简体   繁体   English

包含HTML标记的文件上的Hadoop MapReduce作业

[英]Hadoop MapReduce job on file containing HTML tags

I have a bunch of large HTML files and I want to run a Hadoop MapReduce job on them to find the most frequently used words. 我有一堆大型HTML文件,我想在它们上运行Hadoop MapReduce作业以查找最常用的单词。 I wrote both my mapper and reducer in Python and used Hadoop streaming to run them. 我用Python编写了mapper和reducer,并使用Hadoop流来运行它们。

Here is my mapper: 这是我的映射器:

#!/usr/bin/env python

import sys
import re
import string

def remove_html_tags(in_text):
'''
Remove any HTML tags that are found. 

'''
    global flag
    in_text=in_text.lstrip()
    in_text=in_text.rstrip()
    in_text=in_text+"\n"

    if flag==True: 
        in_text="<"+in_text
        flag=False
    if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None: 
        in_text=in_text+">"
        flag=True
    p = re.compile(r'<[^<]*?>')
    in_text=p.sub('', in_text)
    return in_text

# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
    # remove leading and trailing whitespace, set to lowercase and remove HTMl tags
    line = line.strip().lower()
    line = remove_html_tags(line)
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
       # write the results to STDOUT (standard output);
       # what we output here will be the input for the
       # Reduce step, i.e. the input for reducer.py
       #
       # tab-delimited; the trivial word count is 1
       if word =='': continue
       for c in string.punctuation:
           word= word.replace(c,'')

       print '%s\t%s' % (word, 1)

Here is my reducer: 这是我的减速器:

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        pass

sorted_word2count = sorted(word2count.iteritems(), 
key=lambda(k,v):(v,k),reverse=True)

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print '%s\t%s'% (word, count)

Whenever I just pipe a small sample small string like 'hello world hello hello world ...' I get the proper output of a ranked list. 每当我通过管道传送一个小的示例小字符串(例如“ hello world hello hello world ...”)时,我都会获得适当的排名列表输出。 However, when I try to use a small HTML file, and try using cat to pipe the HTML into my mapper, I get the following error (input2 contains some HTML code): 但是,当我尝试使用一个小的HTML文件并尝试使用cat将HTML管道传输到我的映射器时,出现以下错误(input2包含一些HTML代码):

rohanbk@hadoop:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
  File "/home/rohanbk/reducer.py", line 15, in <module>
    word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack

Can anyone explain why I'm getting this? 谁能解释我为什么得到这个? Also, what is a good way to debug a MapReduce job program? 另外,调试MapReduce作业程序的好方法是什么?

You can reproduce the bug even with just: 您甚至可以重现该错误:

echo "hello - world" | ./mapper.py  | sort | ./reducer.py

The issue is here: 问题在这里:

if word =='': continue
for c in string.punctuation:
           word= word.replace(c,'')

If word is a single punctuation mark, as would be the case for the above input (after it is split), then it is converted to an empty string. 如果word是单个标点符号(与上述输入一样)(拆分后),那么它将转换为空字符串。 So, just move the check for an empty string to after the replacement. 因此,只需将替换后的空字符串检查移至即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM