根据 PySpark 中的值相等过滤键/值对的 RDD

Question

Given给定

[('Project', 10),
 ("Alice's", 11),
 ('in', 401),
 ('Wonderland,', 3),
 ('Lewis', 10),
 ('Carroll', 4),
 ('', 2238),
 ('is', 10),
 ('use', 24),
 ('of', 596),
 ('anyone', 4),
 ('anywhere', 3),

in which the value of the paired RDD is the word frequency.其中配对 RDD 的值为词频。

I would only like to return the words that appear 10 times.我只想返回出现 10 次的单词。 Expected output预期 output

 [('Project', 10),
   ('Lewis', 10),
   ('is', 10)]

I tried using我尝试使用

rdd.filter(lambda words: (words,10)).collect()

But it still shows the entire list.但它仍然显示整个列表。 How should I go about this?我应该如何 go 关于这个？

Answer 1

Your lambda function is wrong;你的 lambda function 错了； It should be它应该是

rdd.filter(lambda words: words[1] == 10).collect()

For example,例如，

my_rdd = sc.parallelize([('Project', 10), ("Alice's", 11), ('in', 401), ('Wonderland,', 3), ('Lewis', 10)], ('is', 10)]

>>> my_rdd.filter(lambda w: w[1] == 10).collect()
[('Project', 10), ('Lewis', 10), ('is', 10)]

根据 PySpark 中的值相等过滤键/值对的 RDD

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-04-25 12:57:19

根据 PySpark 中的值相等过滤键/值对的 RDD

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-04-25 12:57:19

解决方案1
4 已采纳 2020-04-25 12:57:19