[英]Pyspark DataFrame Filter column based on a column in another DataFrame without join
I have a pyspark dataframe called df1
that looks like this:我有一个名为
df1
的 pyspark dataframe,如下所示:
ID1 ![]() |
ID2 ![]() |
---|---|
aaaa![]() |
a1 ![]() |
bbbb ![]() |
a2 ![]() |
aaaa![]() |
a3 ![]() |
bbbb ![]() |
a4 ![]() |
cccc ![]() |
a2 ![]() |
And I have another dataframe called df2
that looks like this:我还有另一个名为
df2
的 dataframe,如下所示:
ID2_1 ![]() |
ID2_2 ![]() |
---|---|
a2 ![]() |
a1 ![]() |
a3 ![]() |
a2 ![]() |
a2 ![]() |
a3 ![]() |
a2 ![]() |
a1 ![]() |
where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.其中第一个 dataframe 中 ID2 的值与第二个 dataframe 中 ID2_1、ID2_2 列中的值匹配。
So the resultant dataframe will look like this:因此,结果 dataframe 将如下所示:
ID1 ![]() |
ID2 ![]() |
---|---|
aaaa![]() |
a1 ![]() |
bbbb ![]() |
a2 ![]() |
aaaa![]() |
a3 ![]() |
cccc ![]() |
a2 ![]() |
(fourth line was filtered out) (第四行被过滤掉了)
I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2.我想过滤 ID2 列以仅包含出现在 ID2_1 或 ID2_2 列之一中的值。 I tried doing
我试着做
filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
(f.col("ID2").isin(df2.ID2_2)))
But this doesn't seem to work.但这似乎不起作用。 I have seen other suggestions to use a
join
between the two columns but this operation is way too heavy and I'm trying to avoid such actions.我已经看到其他建议在两列之间使用
join
,但此操作太繁重,我试图避免此类操作。 Any suggestions as to how to do this task?关于如何执行此任务的任何建议?
Not sure why you would want to avoid join because it may as well be computationa;;y expensive.不确定为什么要避免加入,因为它也可能是计算成本高昂的。
Anyway反正
Code below下面的代码
new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))
new.show(truncate=False)
+----+---+
|ID1 |ID2|
+----+---+
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
+----+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.