简体   繁体   English

Pyspark DataFrame 根据另一列过滤列 DataFrame 无连接

[英]Pyspark DataFrame Filter column based on a column in another DataFrame without join

I have a pyspark dataframe called df1 that looks like this:我有一个名为df1的 pyspark dataframe,如下所示:

ID1 ID1 ID2 ID2
aaaa啊啊啊 a1 a1
bbbb bbbb a2 a2
aaaa啊啊啊 a3 a3
bbbb bbbb a4 a4
cccc cccc a2 a2

And I have another dataframe called df2 that looks like this:我还有另一个名为df2的 dataframe,如下所示:

ID2_1 ID2_1 ID2_2 ID2_2
a2 a2 a1 a1
a3 a3 a2 a2
a2 a2 a3 a3
a2 a2 a1 a1

where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.其中第一个 dataframe 中 ID2 的值与第二个 dataframe 中 ID2_1、ID2_2 列中的值匹配。

So the resultant dataframe will look like this:因此,结果 dataframe 将如下所示:

ID1 ID1 ID2 ID2
aaaa啊啊啊 a1 a1
bbbb bbbb a2 a2
aaaa啊啊啊 a3 a3
cccc cccc a2 a2

(fourth line was filtered out) (第四行被过滤掉了)

I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2.我想过滤 ID2 列以仅包含出现在 ID2_1 或 ID2_2 列之一中的值。 I tried doing我试着做

filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
                   (f.col("ID2").isin(df2.ID2_2)))

But this doesn't seem to work.但这似乎不起作用。 I have seen other suggestions to use a join between the two columns but this operation is way too heavy and I'm trying to avoid such actions.我已经看到其他建议在两列之间使用join ,但此操作太繁重,我试图避免此类操作。 Any suggestions as to how to do this task?关于如何执行此任务的任何建议?

Not sure why you would want to avoid join because it may as well be computationa;;y expensive.不确定为什么要避免加入,因为它也可能是计算成本高昂的。

Anyway反正

  1. create a list of the df2 columns创建 df2 列的列表
  2. append the distinct elements of 1 above in df2 append 上述 1 在 df2 中的不同元素
  3. Filter out where ID2 contains elements in 2 above.过滤掉ID2中包含上面2中元素的地方。

Code below下面的代码

new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))

new.show(truncate=False)

+----+---+
|ID1 |ID2|
+----+---+
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
+----+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM