简体   繁体   English

Spark:减去两个数据帧

[英]Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDD s to end up with only the different content from the first one在 Spark 版本1.2.0 中,可以使用带有 2 个SchemRDDsubtract来结束仅与第一个不同的内容

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD . onlyNewData包含的行todaySchemRDD不中不存在yesterdaySchemaRDD

How can this be achieved with DataFrames in Spark version 1.3.0 ?如何使用 Spark 1.3.0版中的DataFrames实现这一点?

According to the Scala API docs , doing:根据Scala API docs ,做:

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.将返回一个新的 DataFrame,其中包含 dataFrame1 中的行,但不包含 dataframe2 中的行。

In PySpark it would be subtractPySpark 中它会被subtract

df1.subtract(df2)

or exceptAll if duplicates need to be preservedexceptAll如果需要保留重复项

df1.exceptAll(df2)

I tried subtract, but the result was not consistent.我试过减法,但结果不一致。 If I run df1.subtract(df2) , not all lines of df1 are shown on the result dataframe, probably due distinct cited on the docs.如果我运行df1.subtract(df2) ,则并非所有 df1 行都显示在结果数据框中,可能是由于文档中引用的distinct

exceptAll solved my problem: df1.exceptAll(df2) exceptAll解决了我的问题: df1.exceptAll(df2)

From Spark 1.3.0, you can use join with 'left_anti' option:从 Spark 1.3.0 开始,您可以将join'left_anti'选项一起使用:

df1.join(df2, on='key_column', how='left_anti')

These are Pyspark APIs , but I guess there is a correspondent function in Scala too.这些是Pyspark API ,但我想 Scala 中也有相应的函数。

For me, df1.subtract(df2) was inconsistent.对我来说, df1.subtract(df2)是不一致的。 Worked correctly on one dataframe, but not on the other.在一个数据帧上正常工作,但在另一个数据帧上没有。 That was because of duplicates.那是因为重复。 df1.exceptAll(df2) returns a new dataframe with the records from df1 that do not exist in df2, including any duplicates. df1.exceptAll(df2)返回一个新的数据df1.exceptAll(df2) ,其中包含 df1 中不存在于 df2 中的记录,包括任何重复项。

从 Spark 2.4.0 - exceptAll

data_cl = reg_data.exceptAll(data_fr)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM