I have a pyspark dataframe called df1
that looks like this:
ID1 | ID2 |
---|---|
aaaa | a1 |
bbbb | a2 |
aaaa | a3 |
bbbb | a4 |
cccc | a2 |
And I have another dataframe called df2
that looks like this:
ID2_1 | ID2_2 |
---|---|
a2 | a1 |
a3 | a2 |
a2 | a3 |
a2 | a1 |
where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.
So the resultant dataframe will look like this:
ID1 | ID2 |
---|---|
aaaa | a1 |
bbbb | a2 |
aaaa | a3 |
cccc | a2 |
(fourth line was filtered out)
I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2. I tried doing
filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
(f.col("ID2").isin(df2.ID2_2)))
But this doesn't seem to work. I have seen other suggestions to use a join
between the two columns but this operation is way too heavy and I'm trying to avoid such actions. Any suggestions as to how to do this task?
Not sure why you would want to avoid join because it may as well be computationa;;y expensive.
Anyway
Code below
new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))
new.show(truncate=False)
+----+---+
|ID1 |ID2|
+----+---+
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
+----+---+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.