简体   繁体   English

MapReduce WordCount with prop名词

[英]MapReduce WordCount with prop nouns

I'm trying to make a MapReduce WordCount that get a large article and counts proper nouns.我正在尝试制作一个 MapReduce WordCount,它可以获得一篇大文章并计算专有名词。 Here's requirements:以下是要求:

  1. Starts with a capital letter and has never been found in the text with a small letter以大写字母开头,从未在小写字母的文本中找到
  2. Has length between 2 and 7 letters长度在 2 到 7 个字母之间
  3. Sort in descending order按降序排序

Looks like a typical WordCount mapreduce, but I couldn't do this.看起来像典型的 WordCount mapreduce,但我做不到。 How to get rid of all the punctuation marks?如何去掉所有的标点符号? What's the right way to construct mapper and reducer?构造mapper和reducer的正确方法是什么?

import sys
import re

for line in sys.stdin:
    article_id, text = line.strip().split('\t', 1)
    text = re.sub('\W', ' ', text).split(' ')
    for word in text:
        if len(word) >= 2 and len(word) < 7:
            key = "".join(sorted(word.lower()))
            print("{}\t{}\t{}".format(key, word.lower(), 1))

If you are only looking for words, since you already imported re, you can use re.compile (one solution):如果您只是在寻找单词,因为您已经导入了 re,您可以使用 re.compile(一种解决方案):

re.compile('\w+').findall(text)

This way you remove all the punctuation in the string, keeping only letters and numbers.这样您就可以删除字符串中的所有标点符号,只保留字母和数字。

If you take the string below:如果您采用以下字符串:

text = "Looks like a typical WordCount mapreduce, but I couldn't do this. How to get rid of all the punctuation marks"

you quickly obtain:您快速获得:

liste = ['Looks', 'like', 'a', 'typical', 'WordCount', 'mapreduce', 'but', 'I', 'couldn', 't', 'do', 'this', 'How', 'to', 'get', 'rid', 'of', 'all', 'the', 'punctuation', 'marks']

On which you can run your for loop in the same way.您可以在其上以相同的方式运行您的 for 循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM