Pyspark-查找关键值对的交集

Question

Could you please help me to achieve the following: 您能否帮助我实现以下目标：

Consider an i/p which is a list of key value pairs, where key is a (tuple) and value is [list]. 考虑一个i / p，它是键值对的列表，其中key是一个（元组），值是[list]。 If there exist two same keys in the i/p then the value should be. 如果i / p中存在两个相同的键，则该值应该为。 intersected. 相交。 If you do not find another key pair then, that key should be ignored in the o/p. 如果找不到其他密钥对，则该密钥应在o / p中忽略。

Example: 例：

>>> data = [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 2), [1, 3, 4, 5, 6, 7, 10, 11])]
>>> rdd = sc.parallelize(data)
>>> rdd.reduceByKey(lambda x,y : list(set(x).intersection(y))).collect()

o/p: [((1, 2), [3, 4, 5, 6, 7, 10, 11])] o / p： [((1, 2), [3, 4, 5, 6, 7, 10, 11])]

Since there are two occurrences of same key hence the values were intersected. 由于存在两次相同的键，因此将值相交。

>>> data = [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])]
>>> rdd = sc.parallelize(data)
>>> rdd.reduceByKey(lambda x,y : list(set(x).intersection(y))).collect()

o/p that I get: [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])] 我得到的o / p： [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])]

Intended Result: o/p should be nothing. 预期结果：o / p应该为空。 Since there is no matching Key pair. 由于没有匹配的密钥对。

Answer 1

I implemented the logic as below 我实现了如下逻辑

datardd.map(lambda x:(x[0],(x[1],1))).reduceByKey(lambda x,y:(set(x[0]).intersection(y[0]),x[1]+y[1])).filter(lambda x:x[1][1]>1).map(lambda x:(x[0],list(x[1][0]))).collect()

Map a counter variable for the given key along with existing list value in the form of 将给定键的计数器变量与现有列表值一起映射为
[(Key,(value,counter))]: [（键，（值，计数器））]：
ex:[((1, 2), ([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 1))] 例如：[（（（1，2），（[2，3，4，5，6，7，8，8，9，10，12，12，13，14]，1））]]
Use reduceByKey for implementing the intersection operation and incrementing the counter 使用reduceByKey实现交叉操作并增加计数器
Filter and publish the values for which counter value >1 筛选并发布计数器值> 1的值

Pyspark-查找关键值对的交集

问题描述

1 个解决方案

解决方案1
1 2016-02-17 06:57:29

Pyspark-查找关键值对的交集

问题描述

1 个解决方案

解决方案1 1 2016-02-17 06:57:29

解决方案1
1 2016-02-17 06:57:29