MapReduce WordCount with prop名词

Question

I'm trying to make a MapReduce WordCount that get a large article and counts proper nouns.我正在尝试制作一个 MapReduce WordCount，它可以获得一篇大文章并计算专有名词。 Here's requirements:以下是要求：

Starts with a capital letter and has never been found in the text with a small letter以大写字母开头，从未在小写字母的文本中找到
Has length between 2 and 7 letters长度在 2 到 7 个字母之间
Sort in descending order按降序排序

Looks like a typical WordCount mapreduce, but I couldn't do this.看起来像典型的 WordCount mapreduce，但我做不到。 How to get rid of all the punctuation marks?如何去掉所有的标点符号？ What's the right way to construct mapper and reducer?构造mapper和reducer的正确方法是什么？

import sys
import re

for line in sys.stdin:
    article_id, text = line.strip().split('\t', 1)
    text = re.sub('\W', ' ', text).split(' ')
    for word in text:
        if len(word) >= 2 and len(word) < 7:
            key = "".join(sorted(word.lower()))
            print("{}\t{}\t{}".format(key, word.lower(), 1))

Answer 1

If you are only looking for words, since you already imported re, you can use re.compile (one solution):如果您只是在寻找单词，因为您已经导入了 re，您可以使用 re.compile（一种解决方案）：

re.compile('\w+').findall(text)

This way you remove all the punctuation in the string, keeping only letters and numbers.这样您就可以删除字符串中的所有标点符号，只保留字母和数字。

If you take the string below:如果您采用以下字符串：

text = "Looks like a typical WordCount mapreduce, but I couldn't do this. How to get rid of all the punctuation marks"

you quickly obtain:您快速获得：

liste = ['Looks', 'like', 'a', 'typical', 'WordCount', 'mapreduce', 'but', 'I', 'couldn', 't', 'do', 'this', 'How', 'to', 'get', 'rid', 'of', 'all', 'the', 'punctuation', 'marks']

On which you can run your for loop in the same way.您可以在其上以相同的方式运行您的 for 循环。

MapReduce WordCount with prop名词

问题描述

1 个解决方案

解决方案1
0 2022-09-08 22:40:13

MapReduce WordCount with prop名词

问题描述

1 个解决方案

解决方案1 0 2022-09-08 22:40:13

解决方案1
0 2022-09-08 22:40:13