简体   繁体   English

如何在Spark 2.3.0中减去两个保留重复项的DataFrame

[英]How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Spark 2.4.0 introduces new handy function exceptAll which allows to subtract two dataframes, keeping duplicates. Spark 2.4.0引入了新的便捷功能, exceptAll允许减去两个数据帧,并保持重复。

Example

  val df1 = Seq(
    ("a", 1L),
    ("a", 1L),
    ("a", 1L),
    ("b", 2L)
  ).toDF("id", "value")
  val df2 = Seq(
    ("a", 1L),
    ("b", 2L)
  ).toDF("id", "value")

df1.exceptAll(df2).collect()
// will return

Seq(("a", 1L),("a", 1L))

However I can only use Spark 2.3.0. 但是我只能使用Spark 2.3.0。

What is the best way to implement this using only functions from Spark 2.3.0? 仅使用Spark 2.3.0中的函数来实现此目的的最佳方法是什么?

One option is to use row_number to generate a sequential number column and use it on a left join to get the missing rows. 一种选择是使用row_number生成序列号列,并在left join row_number上使用它来获取缺少的行。

PySpark solution shown here. 这里显示PySpark解决方案。

 from pyspark.sql.functions import row_number
 from pyspark.sql import Window
 w1 = Window.partitionBy(df1.id).orderBy(df1.value)
 w2 = Window.partitionBy(df2.id).orderBy(df2.value)
 df1 = df1.withColumn("rnum", row_number().over(w1))
 df2 = df2.withColumn("rnum", row_number().over(w2))
 res_like_exceptAll = df1.join(df2, (df1.id==df2.id) & (df1.val == df2.val) & (df1.rnum == df2.rnum), 'left') \
                         .filter(df2.id.isNull()) \ #Identifies missing rows 
                         .select(df1.id,df1.value)
 res_like_exceptAll.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM