是否有任何替代完全外连接来比较没有键列的 PySpark 数据帧？

Question

So I've been looking at different ways to compare two PySpark dataframes where we have no key columns.所以我一直在寻找不同的方法来比较两个没有关键列的 PySpark 数据帧。

Let's say I have two dataframes, df1 & df2, with columns col1, col2, col3.假设我有两个数据框 df1 和 df2，列 col1、col2、col3。

The idea is that I would get an output dataframe containing rows from df1 that do not match with any rows in df2 and vice versa.这个想法是我会得到一个 output dataframe 包含来自 df1 的行，这些行与 df2 中的任何行都不匹配，反之亦然。 I would also like some kind of flag so I can distinguish between rows from df1 and rows from df2.我还想要某种标志，这样我就可以区分来自 df1 的行和来自 df2 的行。

I have so far looked at a full outer join as method, such as:到目前为止，我已经将完整的外部联接视为方法，例如：

columns = df1.columns
df1 = df1.withColumn("df1_flag", lit("X"))
df2 = df2.withColumn("df2_flag", lit("X"))
df3 = df1.join(df2, columns, how = 'full')\
    .withColumn("FLAG", when(col("df1_flag").isNotNull() & col("df2_flag").isNotNull(), "MATCHED")\
    .otherwise(when(col("df1_flag").isNotNull(), "df1").otherwise("df2"))).drop("df1_flag","df2_flag")
df4 = df3.filter(df3.flag != "MATCHED")

The issue with the full outer join is that I may need to deal with some very large dataframes (1 million + records), I am concerned about efficiency.全外连接的问题是我可能需要处理一些非常大的数据帧（100 万+条记录），我担心效率。 I have thought about using an anti left join and an anti right join and then combining, but still there are efficiency worries with that also.我曾考虑过使用反左连接和反右连接然后组合，但仍然存在效率问题。

Is there any method of comparison I am overlooking here that could be more efficient for very large dataframes?有没有我在这里忽略的比较方法对于非常大的数据帧可能更有效？

Answer 1

You can run a minus query on your dataframes您可以对数据框运行减号查询

Mismatvhed_df1 = df1.exceptAll(df2)
Mismatvhed_df2 = df2.exceptAll(df1)

是否有任何替代完全外连接来比较没有键列的 PySpark 数据帧？

问题描述

1 个解决方案

解决方案1
0 2020-07-29 00:11:05

是否有任何替代完全外连接来比较没有键列的 PySpark 数据帧？

问题描述

1 个解决方案

解决方案1 0 2020-07-29 00:11:05

解决方案1
0 2020-07-29 00:11:05