加入 Pyspark 数据帧，其中两个列表共享一个值

Question

I have two dataframes of the form我有两个表单的数据框

df1 = 
+------+---------+
|group1|  members|
+------+---------+
|     1|[a, b, c]|
|     2|[d, e, f]|
|     3|[g, h, i]|
+------+---------+

df2 = 
+------+---------+
|group2|  members|
+------+---------+
|     4|[s, t, d]|
|     5|[u, v, w]|
|     6|[x, y, b]|
+------+---------+

I would like to perform a join on these dataframes based on a condition when the members lists share a common value.我想根据成员列表共享共同值的条件对这些数据框执行连接。 For example, group2 would map onto df1 as:例如， group2会将 map 放到df1上，如下所示：

+------+---------+------+
|group1|  members|group2|
+------+---------+------+
|     1|[a, b, c]|     6|
|     2|[d, e, f]|     4|
|     3|[g, h, i]|      |
+------+---------+------+

Is there an efficient method for this?有没有一种有效的方法？ At the moment I am just looping through the rows of df2 and using f.array_intersect() to compare.目前我只是遍历df2的行并使用f.array_intersect()进行比较。

Answer 1

You can use a left join, the join condition is to use the size function to determine that the intersection of df1 and df2 is greater than 0.可以使用左连接，连接条件是使用size function来判断df1和df2的交集大于0。

df2 = df2.toDF('group2', 'members2')
df = df1.join(df2, F.size(F.array_intersect(df1.members, df2.members2)) > 0, 'left').drop('members2')
df.show(truncate=False)

加入 Pyspark 数据帧，其中两个列表共享一个值

问题描述

1 个解决方案

解决方案1
0 2022-01-13 06:11:58

加入 Pyspark 数据帧，其中两个列表共享一个值

问题描述

1 个解决方案

解决方案1 0 2022-01-13 06:11:58

解决方案1
0 2022-01-13 06:11:58