[英]Filter RDD of key/value pairs based on value equality in PySpark
Given给定
[('Project', 10),
("Alice's", 11),
('in', 401),
('Wonderland,', 3),
('Lewis', 10),
('Carroll', 4),
('', 2238),
('is', 10),
('use', 24),
('of', 596),
('anyone', 4),
('anywhere', 3),
in which the value of the paired RDD is the word frequency.其中配对 RDD 的值为词频。
I would only like to return the words that appear 10 times.我只想返回出现 10 次的单词。 Expected output
预期 output
[('Project', 10),
('Lewis', 10),
('is', 10)]
I tried using我尝试使用
rdd.filter(lambda words: (words,10)).collect()
But it still shows the entire list.但它仍然显示整个列表。 How should I go about this?
我应该如何 go 关于这个?
Your lambda function is wrong;你的 lambda function 错了; It should be
它应该是
rdd.filter(lambda words: words[1] == 10).collect()
For example,例如,
my_rdd = sc.parallelize([('Project', 10), ("Alice's", 11), ('in', 401), ('Wonderland,', 3), ('Lewis', 10)], ('is', 10)]
>>> my_rdd.filter(lambda w: w[1] == 10).collect()
[('Project', 10), ('Lewis', 10), ('is', 10)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.