Spark: subtract two DataFrames

Question

In Spark version 1.2.0 one could use subtract with 2 SchemRDD s to end up with only the different content from the first one

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD .

How can this be achieved with DataFrames in Spark version 1.3.0 ?

Answer 1

According to the Scala API docs , doing:

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

Answer 2

In PySpark it would be subtract

df1.subtract(df2)

or exceptAll if duplicates need to be preserved

df1.exceptAll(df2)

Answer 3

I tried subtract, but the result was not consistent. If I run df1.subtract(df2) , not all lines of df1 are shown on the result dataframe, probably due distinct cited on the docs.

exceptAll solved my problem: df1.exceptAll(df2)

Answer 4

From Spark 1.3.0, you can use join with 'left_anti' option:

df1.join(df2, on='key_column', how='left_anti')

These are Pyspark APIs , but I guess there is a correspondent function in Scala too.

Answer 5

For me, df1.subtract(df2) was inconsistent. Worked correctly on one dataframe, but not on the other. That was because of duplicates. df1.exceptAll(df2) returns a new dataframe with the records from df1 that do not exist in df2, including any duplicates.

Answer 6

从 Spark 2.4.0 - exceptAll

data_cl = reg_data.exceptAll(data_fr)

Spark: subtract two DataFrames

Question

6 answers

solution1
91 ACCPTED 2015-04-10 09:12:12

solution2
57 2016-06-15 14:01:36

solution3
10 2018-11-29 01:38:53

solution4
3 2021-05-26 07:51:11

solution5
2 2020-10-26 02:24:50

solution6
1 2021-01-05 10:45:51

Spark: subtract two DataFrames

Question

6 answers

solution1 91 ACCPTED 2015-04-10 09:12:12

solution2 57 2016-06-15 14:01:36

solution3 10 2018-11-29 01:38:53

solution4 3 2021-05-26 07:51:11

solution5 2 2020-10-26 02:24:50

solution6 1 2021-01-05 10:45:51

solution1
91 ACCPTED 2015-04-10 09:12:12

solution2
57 2016-06-15 14:01:36

solution3
10 2018-11-29 01:38:53

solution4
3 2021-05-26 07:51:11

solution5
2 2020-10-26 02:24:50

solution6
1 2021-01-05 10:45:51