简体   繁体   English

在 Pyspark 中包含特殊字符并忽略大小写的同时计算单词?

[英]Counting Words while including special characters and disregarding capitilization in Pyspark?

I'm working on a small project to understand PySpark and I'm trying to get PySpark to do the following actions on the words in a txtfile;我正在做一个了解 PySpark 的小项目,我试图让 PySpark 对 txtfile 中的单词执行以下操作; it should "ignore" any changes in capitalization to the words (ie, While vs while) and it should "ignore" any additional characters that might be on the end of the words (ie, orange vs orange, vs orange. vs orange?) and count them all as the same word.它应该“忽略”单词大写的任何变化(即,While 与 while)并且它应该“忽略”可能出现在单词末尾的任何其他字符(即,orange vs orange, vs orange. vs orange? ) 并将它们都算作同一个词。

I am fairly certain some kind of lambda function or regex expression is required, but I don't know how to generalize it enough that I can pop any sort of textfile (like a book) in and have it spit back the correct analysis.我相当肯定某种 lambda 函数或正则表达式是必需的,但我不知道如何对其进行足够的概括,以至于我可以将任何类型的文本文件(例如一本书)放入其中并让它返回正确的分析。

Here's my Code so far:到目前为止,这是我的代码:

import sys

from pyspark import SparkContext, SparkConf

input = sc.textFile("/home/user/YOURFILEHERE.txt")
words = input.flatMap(lambda line: line.split(" "))
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)
wordCounts.collect() 

The last thing I need to do is make a frequency analysis for the words (ie, the word "While" shows up 80% of the time) but I am fairly certain how to do that and am currently adding it in for what I have now;我需要做的最后一件事是对单词进行频率分析(即“While”这个词出现的时间为 80%),但我相当确定如何做到这一点,并且目前正在将其添加到我拥有的内容中现在; I'm just having so many issues with the capitalization and the special character inclusion.我只是在大写和特殊字符包含方面遇到了很多问题。

Any help on this issue, even just guidance would be great.关于这个问题的任何帮助,即使只是指导也会很棒。 Thank you guys!谢谢你们!

just replace the input with your text file, the key is the function word_munge只需用您的文本文件替换输入,关键是函数word_munge

import string
import re
def word_munge(single_word):                                                                                                                               
    lower_case_word=single_word.lower()                                                                                                                    
    return re.sub(f"[{re.escape(string.punctuation)}]", "", lower_case_word)

input_string="While orange, while orange while orange." 
input_rdd = sc.parallelize([input_string])                                                                                                                  
words = input_rdd.flatMap(lambda line: line.split(" "))
(words.
 map(word_munge).
 map(lambda word: (word, 1)).
 reduceByKey(lambda a, b: a+ b)
).take(2)                                                

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM