[英]How to find the frequency of a word in a line, in a text file - Pyspark
I have managed to make an RDD (in Pyspark) that looks like this:我设法制作了一个看起来像这样的 RDD(在 Pyspark 中):
[('This', (0, 1)), ('is', (0, 1)), ('the', (0, 1)), ('100th', (0, 1))...]
I used the following code: RDD=sc.textFile(_filepath_)
我使用了以下代码:
RDD=sc.textFile(_filepath_)
test1 = RDD.zipWithIndex().flatMap(lambda x: ((i,(x[1],1)) for i in x[0].split(" ")))
Practically, [(word, (line, freq)]
so the above words are from the 1st line in the file (hence the 0) and freq
is 1 for all words in the text, and I want it to count the times this word appears on this specific line, for the entire RDD. I thought of .reduceByKey(lambda x, y: x + y)
but when I execute an action like .take(5)
after that, it freezes (Ubuntu terminal - Oracle VirtualBox with plenty of RAM/disk space, if it helps).实际上,
[(word, (line, freq)]
所以上面的单词来自文件的第一行(因此为 0),并且对于文本中的所有单词来说, freq
为 1,我希望它计算这个单词的次数对于整个RDD,出现在此特定行上。我想到了.reduceByKey(lambda x, y: x + y)
但是当我执行类似.take(5)
之后的操作时,它会冻结(Ubuntu终端 - Oracle VirtualBox大量的 RAM/磁盘空间,如果有帮助的话)。
Completely stuck on this stage.
完全卡在了这个阶段。
What I need is basically, if the word 'This' is in first line and it's there 7 times, then the result will be [('This', (0, 7)), ...]
我需要的基本上是,如果“This”这个词在第一行并且出现了 7 次,那么结果将是
[('This', (0, 7)), ...]
Solved it, but the answer may not be optimal.解决了它,但答案可能不是最佳的。
RDD = sc.textFile(_filepath_)
test1 = RDD.zipWithIndex().flatMap(lambda x: ((i,(x[1],1)) for i in x[0].split(" ")))
test2 = test1.map(lambda x: ((x[0], x[1][0]), x[1][1])).reduceByKey(lambda x, y: x + y)
Result_RDD = test2.map(lambda x: (x[0][0], (x[0][1], x[1])))
Given you have a list with all your lines, something like this should work for you:鉴于您有一个包含所有行的列表,这样的事情应该适合您:
#lines = [List of all your lines]
wordInLineFrequencyList = []
for i in range(len(lines)):
currentLineWords = []
for word in lines[i].split():
if word not in currentLineWords:
currentLineWords.append(word)
wordInLineFrequencyList.append((word, (i, lines[i].count(word))))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.