Spark：减去两个数据帧

Question

In Spark version 1.2.0 one could use subtract with 2 SchemRDD s to end up with only the different content from the first one在 Spark 版本1.2.0 中，可以使用带有 2 个SchemRDD的subtract来结束仅与第一个不同的内容

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD . onlyNewData包含的行todaySchemRDD不中不存在yesterdaySchemaRDD 。

How can this be achieved with DataFrames in Spark version 1.3.0 ?如何使用 Spark 1.3.0版中的DataFrames实现这一点？

Answer 1

According to the Scala API docs , doing:根据Scala API docs ，做：

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.将返回一个新的 DataFrame，其中包含 dataFrame1 中的行，但不包含 dataframe2 中的行。

Answer 2

In PySpark it would be subtract在PySpark 中它会被subtract

df1.subtract(df2)

or exceptAll if duplicates need to be preserved或exceptAll如果需要保留重复项

df1.exceptAll(df2)

Answer 3

I tried subtract, but the result was not consistent.我试过减法，但结果不一致。 If I run df1.subtract(df2) , not all lines of df1 are shown on the result dataframe, probably due distinct cited on the docs.如果我运行df1.subtract(df2) ，则并非所有 df1 行都显示在结果数据框中，可能是由于文档中引用的distinct 。

exceptAll solved my problem: df1.exceptAll(df2) exceptAll解决了我的问题： df1.exceptAll(df2)

Answer 4

From Spark 1.3.0, you can use join with 'left_anti' option:从 Spark 1.3.0 开始，您可以将join与'left_anti'选项一起使用：

df1.join(df2, on='key_column', how='left_anti')

These are Pyspark APIs , but I guess there is a correspondent function in Scala too.这些是Pyspark API ，但我想 Scala 中也有相应的函数。

Answer 5

For me, df1.subtract(df2) was inconsistent.对我来说， df1.subtract(df2)是不一致的。 Worked correctly on one dataframe, but not on the other.在一个数据帧上正常工作，但在另一个数据帧上没有。 That was because of duplicates.那是因为重复。 df1.exceptAll(df2) returns a new dataframe with the records from df1 that do not exist in df2, including any duplicates. df1.exceptAll(df2)返回一个新的数据df1.exceptAll(df2) ，其中包含 df1 中不存在于 df2 中的记录，包括任何重复项。

Answer 6

从 Spark 2.4.0 - exceptAll

data_cl = reg_data.exceptAll(data_fr)

Spark：减去两个数据帧

问题描述

6 个解决方案

解决方案1
91 已采纳 2015-04-10 09:12:12

解决方案2
57 2016-06-15 14:01:36

解决方案3
10 2018-11-29 01:38:53

解决方案4
3 2021-05-26 07:51:11

解决方案5
2 2020-10-26 02:24:50

解决方案6
1 2021-01-05 10:45:51

Spark：减去两个数据帧

问题描述

6 个解决方案

解决方案1 91 已采纳 2015-04-10 09:12:12

解决方案2 57 2016-06-15 14:01:36

解决方案3 10 2018-11-29 01:38:53

解决方案4 3 2021-05-26 07:51:11

解决方案5 2 2020-10-26 02:24:50

解决方案6 1 2021-01-05 10:45:51

解决方案1
91 已采纳 2015-04-10 09:12:12

解决方案2
57 2016-06-15 14:01:36

解决方案3
10 2018-11-29 01:38:53

解决方案4
3 2021-05-26 07:51:11

解决方案5
2 2020-10-26 02:24:50

解决方案6
1 2021-01-05 10:45:51