[英]MapReduce WordCount with prop nouns
I'm trying to make a MapReduce WordCount that get a large article and counts proper nouns.我正在尝试制作一个 MapReduce WordCount,它可以获得一篇大文章并计算专有名词。 Here's requirements:
以下是要求:
Looks like a typical WordCount mapreduce, but I couldn't do this.看起来像典型的 WordCount mapreduce,但我做不到。 How to get rid of all the punctuation marks?
如何去掉所有的标点符号? What's the right way to construct mapper and reducer?
构造mapper和reducer的正确方法是什么?
import sys
import re
for line in sys.stdin:
article_id, text = line.strip().split('\t', 1)
text = re.sub('\W', ' ', text).split(' ')
for word in text:
if len(word) >= 2 and len(word) < 7:
key = "".join(sorted(word.lower()))
print("{}\t{}\t{}".format(key, word.lower(), 1))
If you are only looking for words, since you already imported re, you can use re.compile (one solution):如果您只是在寻找单词,因为您已经导入了 re,您可以使用 re.compile(一种解决方案):
re.compile('\w+').findall(text)
This way you remove all the punctuation in the string, keeping only letters and numbers.这样您就可以删除字符串中的所有标点符号,只保留字母和数字。
If you take the string below:如果您采用以下字符串:
text = "Looks like a typical WordCount mapreduce, but I couldn't do this. How to get rid of all the punctuation marks"
you quickly obtain:您快速获得:
liste = ['Looks', 'like', 'a', 'typical', 'WordCount', 'mapreduce', 'but', 'I', 'couldn', 't', 'do', 'this', 'How', 'to', 'get', 'rid', 'of', 'all', 'the', 'punctuation', 'marks']
On which you can run your for loop in the same way.您可以在其上以相同的方式运行您的 for 循环。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.