简体   繁体   English

Pyspark-查找关键值对的交集

[英]Pyspark - Finding Intersection of key value pairs

Could you please help me to achieve the following: 您能否帮助我实现以下目标:

Consider an i/p which is a list of key value pairs, where key is a (tuple) and value is [list]. 考虑一个i / p,它是键值对的列表,其中key是一个(元组),值是[list]。 If there exist two same keys in the i/p then the value should be. 如果i / p中存在两个相同的键,则该值应该为。 intersected. 相交。 If you do not find another key pair then, that key should be ignored in the o/p. 如果找不到其他密钥对,则该密钥应在o / p中忽略。

Example: 例:

>>> data = [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 2), [1, 3, 4, 5, 6, 7, 10, 11])]
>>> rdd = sc.parallelize(data)
>>> rdd.reduceByKey(lambda x,y : list(set(x).intersection(y))).collect()

o/p: [((1, 2), [3, 4, 5, 6, 7, 10, 11])] o / p: [((1, 2), [3, 4, 5, 6, 7, 10, 11])]

Since there are two occurrences of same key hence the values were intersected. 由于存在两次相同的键,因此将值相交。

>>> data = [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])]
>>> rdd = sc.parallelize(data)
>>> rdd.reduceByKey(lambda x,y : list(set(x).intersection(y))).collect()

o/p that I get: [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])] 我得到的o / p: [((1, 2), [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), ((1, 3), [1, 3, 4, 5, 6, 7, 10, 11])]

Intended Result: o/p should be nothing. 预期结果:o / p应该为空。 Since there is no matching Key pair. 由于没有匹配的密钥对。

I implemented the logic as below 我实现了如下逻辑

datardd.map(lambda x:(x[0],(x[1],1))).reduceByKey(lambda x,y:(set(x[0]).intersection(y[0]),x[1]+y[1])).filter(lambda x:x[1][1]>1).map(lambda x:(x[0],list(x[1][0]))).collect()
  1. Map a counter variable for the given key along with existing list value in the form of 将给定键的计数器变量与现有列表值一起映射为

    [(Key,(value,counter))]: [(键,(值,计数器))]:

    ex:[((1, 2), ([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 1))] 例如:[(((1,2),([2,3,4,5,6,7,8,8,9,10,12,12,13,14],1))]]

  2. Use reduceByKey for implementing the intersection operation and incrementing the counter 使用reduceByKey实现交叉操作并增加计数器

  3. Filter and publish the values for which counter value >1 筛选并发布计数器值> 1的值

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark (key,value) 对 (key,[list of values]) - Pyspark (key,value) pairs to (key,[list of values]) 在键/值对列表中查找值 - Finding values in a list of key/value pairs 根据 PySpark 中的值相等过滤键/值对的 RDD - Filter RDD of key/value pairs based on value equality in PySpark Pyspark - 将字符串类型元素数组转换为键值对的 map - Pyspark - Converting an array of string type elements into a map of key value pairs 将 PySpark DataFrame 结构列转换为键值对字符串 - Convert PySpark DataFrame struct column to string of key-value pairs 只查找几个 dicts 的公共键值对:dict 交集 - Find only common key-value pairs of several dicts: dict intersection 在两个 2D arrays 之间查找匹配的值对(交集)? - Finding matching pairs (intersection) of values between two 2D arrays? 如何根据列表格式的值从字典中删除键值对并与其他列表值有一些交集 - how to remove key value pairs from a dictionary based on values which are in list format and having some intersection with other list values 在不使用嵌套循环的情况下在 dict 列表中查找相同的键/值对 - Finding identical key/value pairs in list of dict without the use of nested loop TypeError:不可散列的类型:查找两个字典之间的公用键值对时出现“列表” - TypeError: unhashable type: 'list' occurs when finding common key value pairs between two dictionaries
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM