[英]Pyspark - Finding Intersection of key value pairs
Could you please help me to achieve the following: 您能否帮助我实现以下目标:
Consider an i/p which is a list of key value pairs, where key is a (tuple) and value is [list]. 考虑一个i / p,它是键值对的列表,其中key是一个(元组),值是[list]。 If there exist two same keys in the i/p then the value should be.
如果i / p中存在两个相同的键,则该值应该为。 intersected.
相交。 If you do not find another key pair then, that key should be ignored in the o/p.
如果找不到其他密钥对,则该密钥应在o / p中忽略。
Example: 例:
>>> data = [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 2), [1, 3, 4, 5, 6, 7, 10, 11])]
>>> rdd = sc.parallelize(data)
>>> rdd.reduceByKey(lambda x,y : list(set(x).intersection(y))).collect()
o/p: [((1, 2), [3, 4, 5, 6, 7, 10, 11])]
o / p:
[((1, 2), [3, 4, 5, 6, 7, 10, 11])]
Since there are two occurrences of same key hence the values were intersected. 由于存在两次相同的键,因此将值相交。
>>> data = [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])]
>>> rdd = sc.parallelize(data)
>>> rdd.reduceByKey(lambda x,y : list(set(x).intersection(y))).collect()
o/p that I get: [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])]
我得到的o / p:
[((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])]
Intended Result: o/p should be nothing. 预期结果:o / p应该为空。 Since there is no matching Key pair.
由于没有匹配的密钥对。
I implemented the logic as below 我实现了如下逻辑
datardd.map(lambda x:(x[0],(x[1],1))).reduceByKey(lambda x,y:(set(x[0]).intersection(y[0]),x[1]+y[1])).filter(lambda x:x[1][1]>1).map(lambda x:(x[0],list(x[1][0]))).collect()
Map a counter variable for the given key along with existing list value in the form of 将给定键的计数器变量与现有列表值一起映射为
[(Key,(value,counter))]: [(键,(值,计数器))]:
ex:[((1, 2), ([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 1))] 例如:[(((1,2),([2,3,4,5,6,7,8,8,9,10,12,12,13,14],1))]]
Use reduceByKey for implementing the intersection operation and incrementing the counter 使用reduceByKey实现交叉操作并增加计数器
Filter and publish the values for which counter value >1 筛选并发布计数器值> 1的值
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.