简体   繁体   English

如何在Spark 2.3.0中减去两个保留重复项的DataFrame

[英]How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Spark 2.4.0 introduces new handy function exceptAll which allows to subtract two dataframes, keeping duplicates. Spark 2.4.0引入了新的便捷功能, exceptAll允许减去两个数据帧,并保持重复。


  val df1 = Seq(
    ("a", 1L),
    ("a", 1L),
    ("a", 1L),
    ("b", 2L)
  ).toDF("id", "value")
  val df2 = Seq(
    ("a", 1L),
    ("b", 2L)
  ).toDF("id", "value")

// will return

Seq(("a", 1L),("a", 1L))

However I can only use Spark 2.3.0. 但是我只能使用Spark 2.3.0。

What is the best way to implement this using only functions from Spark 2.3.0? 仅使用Spark 2.3.0中的函数来实现此目的的最佳方法是什么?

One option is to use row_number to generate a sequential number column and use it on a left join to get the missing rows. 一种选择是使用row_number生成序列号列,并在left join row_number上使用它来获取缺少的行。

PySpark solution shown here. 这里显示PySpark解决方案。

 from pyspark.sql.functions import row_number
 from pyspark.sql import Window
 w1 = Window.partitionBy(df1.id).orderBy(df1.value)
 w2 = Window.partitionBy(df2.id).orderBy(df2.value)
 df1 = df1.withColumn("rnum", row_number().over(w1))
 df2 = df2.withColumn("rnum", row_number().over(w2))
 res_like_exceptAll = df1.join(df2, (df1.id==df2.id) & (df1.val == df2.val) & (df1.rnum == df2.rnum), 'left') \
                         .filter(df2.id.isNull()) \ #Identifies missing rows 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM