简体   繁体   English

修改Spark读取的文本文件

[英]Modify a text file read by Spark

I am trying to count words in few text files in a Hadoop Cluster while using Spark. 我试图在使用Spark时计算Hadoop集群中几个文本文件中的单词数。 I manage to get the word count but I also want to do some further modifications such as ignoring numbers or transforming all words to lower case. 我设法获取了字数,但我还想做一些进一步的修改,例如忽略数字或将所有字词转换为小写。 I can't iterate over the RDD-data normally. 我无法正常遍历RDD数据。 I've tried using collect() but the map function does not accept list as an argument. 我试过使用collect(),但是map函数不接受list作为参数。 I've also tried to apply regex logic directly to the "filter" function of RDD but found no success. 我也尝试过将正则表达式逻辑直接应用于RDD的“过滤器”功能,但没有成功。 This is the code I've come up this far, it works without the parts that I've commented out. 这是我到目前为止提出的代码,它不需要我注释掉的部分就可以工作。

from pyspark import SparkConf, SparkContext
import re
conf = SparkConf().setAppName("Word count")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
text = sc.textFile("/data/book/*.txt") \
       .flatMap(lambda line: line.split())

#handledText = text.map(lambda s: s.replace("\d", "", text))
counts = text.map(lambda word: (word, 1)) \
         .groupByKey() \
         .map(lambda p: (p[0], sum(p[1])))
res = counts.takeOrdered(text.count(), key=lambda p: -p[1])
print(res)

text.map(lambda s: s.replace("\\d", "", text))

You are confusing map() built-in function in Python with DataFrame.map() of Spark... No, the text parameter is not valid there. 您正在map() Python中的map()内置函数与Spark的DataFrame.map() ...不, text参数在那里无效。

Try this 尝试这个

def lower_no_digit(word):
    return lower(word.replace(r'\d+', ''))

counts = text.map(lower_no_digit) \ 
             .filter(lambda w : len(w) > 0) \
             .map(lambda word: (word, 1)) \

Which maps a function over the words and filters out the empty ones before applying (word, 1) 它在单词上映射一个函数,并在应用(word, 1)之前过滤掉空的

Aside - Doing the same in SparkSQL is somewhat simpler and doesn't require manually putting (word, 1) 除了 -在SparkSQL做同样略为简单,并不需要手动把(word, 1)

I've tried using collect() 我试过使用collect()

Do not map(lambda x : ... , df.collect()) . 不要map(lambda x : ... , df.collect()) That will bring all the data to the local Spark driver, and defeats the purpose of running a distributed processing framework. 这会将所有数据带到本地Spark驱动程序,并破坏了运行分布式处理框架的目的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM