简体   繁体   中英

Are there any alternatives to a full outer join for comparing PySpark dataframes with no key columns?

So I've been looking at different ways to compare two PySpark dataframes where we have no key columns.

Let's say I have two dataframes, df1 & df2, with columns col1, col2, col3.

The idea is that I would get an output dataframe containing rows from df1 that do not match with any rows in df2 and vice versa. I would also like some kind of flag so I can distinguish between rows from df1 and rows from df2.

I have so far looked at a full outer join as method, such as:

columns = df1.columns
df1 = df1.withColumn("df1_flag", lit("X"))
df2 = df2.withColumn("df2_flag", lit("X"))
df3 = df1.join(df2, columns, how = 'full')\
    .withColumn("FLAG", when(col("df1_flag").isNotNull() & col("df2_flag").isNotNull(), "MATCHED")\
    .otherwise(when(col("df1_flag").isNotNull(), "df1").otherwise("df2"))).drop("df1_flag","df2_flag")
df4 = df3.filter(df3.flag != "MATCHED")

The issue with the full outer join is that I may need to deal with some very large dataframes (1 million + records), I am concerned about efficiency. I have thought about using an anti left join and an anti right join and then combining, but still there are efficiency worries with that also.

Is there any method of comparison I am overlooking here that could be more efficient for very large dataframes?

You can run a minus query on your dataframes

Mismatvhed_df1 = df1.exceptAll(df2)
Mismatvhed_df2 = df2.exceptAll(df1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM