pyspark 中出現的字數相同——

Question

from pyspark import SparkContext
sc = SparkContext("local", "first app")
text = sc.textFile("C:\data.txt")
words = text.map(lambda line: str(line)).flatMap(lambda x: x.lower().split(" "))

print(words.top(100))
total_words = words.count()
print(words.count())
wordCount = words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
print(wordCount.top(20))

輸入： mahi, Mahi, mAhi, maHi, mahI, MAHI, MAhi, MAHi, straw, Straw, STRAW, berry, Berry
輸出： [('straw,', 3), ('mahi,', 8), ('berry,', 1), ('berry', 1)]
但是輸出應該返回[('straw,', 3), ('mahi,', 8), ('berry,', 2)] 。 我是pyspark 。 任何人都可以幫助我代碼有什么問題嗎？

Answer 1

pyspark將,作為字符串的一部分，所以berry,和Berry是不一樣的。 你可以在結果中看到

('berry,', 1)
('berry', 1)

也用逗號分割

text.map(lambda line: str(line)).flatMap(lambda x: x.lower().split(", "))

pyspark 中出現的字數相同——

問題描述

1 個解決方案

解決方案1
0 已采納 2020-02-25 05:42:21

pyspark 中出現的字數相同——

問題描述

1 個解決方案

解決方案1 0 已采納 2020-02-25 05:42:21

解決方案1
0 已采納 2020-02-25 05:42:21