Pyspark DataFrame 根据另一列过滤列 DataFrame 无连接

Question

I have a pyspark dataframe called df1 that looks like this:我有一个名为df1的 pyspark dataframe，如下所示：

ID1 ID1	ID2 ID2
aaaa啊啊啊	a1 a1
bbbb bbbb	a2 a2
aaaa啊啊啊	a3 a3
bbbb bbbb	a4 a4
cccc cccc	a2 a2

And I have another dataframe called df2 that looks like this:我还有另一个名为df2的 dataframe，如下所示：

ID2_1 ID2_1	ID2_2 ID2_2
a2 a2	a1 a1
a3 a3	a2 a2
a2 a2	a3 a3
a2 a2	a1 a1

where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.其中第一个 dataframe 中 ID2 的值与第二个 dataframe 中 ID2_1、ID2_2 列中的值匹配。

So the resultant dataframe will look like this:因此，结果 dataframe 将如下所示：

ID1 ID1	ID2 ID2
aaaa啊啊啊	a1 a1
bbbb bbbb	a2 a2
aaaa啊啊啊	a3 a3
cccc cccc	a2 a2

(fourth line was filtered out) （第四行被过滤掉了）

I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2.我想过滤 ID2 列以仅包含出现在 ID2_1 或 ID2_2 列之一中的值。 I tried doing我试着做

filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
                   (f.col("ID2").isin(df2.ID2_2)))

But this doesn't seem to work.但这似乎不起作用。 I have seen other suggestions to use a join between the two columns but this operation is way too heavy and I'm trying to avoid such actions.我已经看到其他建议在两列之间使用join ，但此操作太繁重，我试图避免此类操作。 Any suggestions as to how to do this task?关于如何执行此任务的任何建议？

Answer 1

Not sure why you would want to avoid join because it may as well be computationa;;y expensive.不确定为什么要避免加入，因为它也可能是计算成本高昂的。

Anyway反正

create a list of the df2 columns创建 df2 列的列表
append the distinct elements of 1 above in df2 append 上述 1 在 df2 中的不同元素
Filter out where ID2 contains elements in 2 above.过滤掉ID2中包含上面2中元素的地方。

Code below下面的代码

new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))

new.show(truncate=False)

+----+---+
|ID1 |ID2|
+----+---+
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
+----+---+

Pyspark DataFrame 根据另一列过滤列 DataFrame 无连接

问题描述

1 个解决方案

解决方案1
0 2022-12-28 10:41:20

Pyspark DataFrame 根据另一列过滤列 DataFrame 无连接

问题描述

1 个解决方案

解决方案1 0 2022-12-28 10:41:20

解决方案1
0 2022-12-28 10:41:20