简体   繁体   English

比较两个 Spark 数据帧

[英]Compare two Spark dataframes

Spark dataframe 1 -: Spark 数据帧 1 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193    |
|city 1|prod 2 |8/25/2017|50  |687|201    |
|city 1|prod 3 |9/9/2017 |236 |431|169    |
|city 2|prod 1 |9/28/2017|358 |975|193    |
|city 2|prod 2 |8/24/2017|50  |687|201    |
|city 3|prod 3 |9/8/2017 |236 |431|169    |
+------+-------+---------+----+---+-------+

Spark dataframe 2 -: Spark 数据框 2 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193    |
|city 1|prod 2 |8/25/2017|50  |687|201    |
|city 1|prod 3 |9/9/2017 |230 |430|160    |
|city 1|prod 4 |9/27/2017|350 |90 |190    |
|city 2|prod 2 |8/24/2017|50  |687|201    |
|city 3|prod 3 |9/8/2017 |236 |431|169    |
|city 3|prod 4 |9/18/2017|230 |431|169    |
+------+-------+---------+----+---+-------+

Please find out spark dataframe for following conditions applied on above given spark dataframe 1 and spark dataframe 2,请找出适用于上述给定火花数据帧 1 和火花数据帧 2 的以下条件的火花数据帧,

  1. Deleted Records删除的记录
  2. New Records新纪录
  3. Records with no changes没有变化的记录
  4. Records with changes有变化的记录

    Here key of comprision are 'city', 'product', 'date'.这里的组合键是“城市”、“产品”、“日期”。

we need solution without using Spark SQL.我们需要不使用 Spark SQL 的解决方案。

I am not sure about finding the deleted and modified records but you can use except function to get the difference我不确定是否找到已删除和已修改的记录,但您可以使用 except 函数来获取差异

df2.except(df1)

This returns the rows that has been added or modified in dataframe2 or record with changes.这将返回已在 dataframe2 或更改记录中添加或修改的行。 Output:输出:

+------+-------+---------+----+---+-------+
|  city|product|     date|sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 3| prod 4|9/18/2017| 230|431|    169|
|city 1| prod 4|9/27/2017| 350| 90|    190|
|city 1| prod 3|9/9/2017 | 230|430|    160|
+------+-------+---------+----+---+-------+

You can also try with join and filter to get the changed and unchanged data as您还可以尝试使用 join 和 filter 来获取已更改和未更改的数据

df1.join(df2, Seq("city","product", "date"), "left").show(false)
df1.join(df2, Seq("city","product", "date"), "right").show(false)

Hope this helps!希望这有帮助!

A scalable and easy way is to diff the two DataFrame s with spark-extension :一种可扩展且简单的方法是使用spark-extension DataFrame两个DataFrame

import uk.co.gresearch.spark.diff._

df1.diff(df2, "city", "product", "date").show

+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
|diff|  city|product|      date|left_sale|right_sale|left_exp|right_exp|left_wastage|right_wastage|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
|   N|city 1|prod 2 |2017-08-25|       50|        50|     687|      687|         201|          201|
|   C|city 1|prod 3 |2017-09-09|      236|       230|     431|      430|         169|          160|
|   I|city 3|prod 4 |2017-09-18|     null|       230|    null|      431|        null|          169|
|   N|city 3|prod 3 |2017-09-08|      236|       236|     431|      431|         169|          169|
|   D|city 2|prod 1 |2017-09-28|      358|      null|     975|     null|         193|         null|
|   I|city 1|prod 4 |2017-09-27|     null|       350|    null|       90|        null|          190|
|   N|city 1|prod 1 |2017-09-29|      358|       358|     975|      975|         193|          193|
|   N|city 2|prod 2 |2017-08-24|       50|        50|     687|      687|         201|          201|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+

It identifies I nserted, C hanged, D eleted and u N -changed rows.它确定nserted,C上吊,d eleted和加利-changed行。

Check out MegaSparkDiff its an open source project on GitHub that helps compare dataframes .. the project is not yet published in maven central but you can look at the SparkCompare scala class that compares 2 dataframes查看 MegaSparkDiff,它是 GitHub 上的一个开源项目,可帮助比较数据帧。该项目尚未在 maven 中心发布,但您可以查看比较 2 个数据帧的 SparkCompare scala 类

the below code snippet will give you 2 dataframes one has rows inLeftButNotInRight and another one having InRightButNotInLeft.下面的代码片段将为您提供 2 个数据帧,其中一个具有 inLeftButNotInRight 行,另一个具有 InRightButNotInLeft。

if you do a JOIN between both then you can apply some logic to identify the missing primary keys (where possible) and then those keys would constitute the deleted records.如果您在两者之间进行 JOIN,那么您可以应用一些逻辑来识别丢失的主键(在可能的情况下),然后这些键将构成已删除的记录。

We are working on adding the use case that you are looking for in the project.我们正在努力添加您在项目中寻找的用例。 https://github.com/FINRAOS/MegaSparkDiff https://github.com/FINRAOS/MegaSparkDiff

https://github.com/FINRAOS/MegaSparkDiff/blob/master/src/main/scala/org/finra/msd/sparkcompare/SparkCompare.scala https://github.com/FINRAOS/MegaSparkDiff/blob/master/src/main/scala/org/finra/msd/sparkcompare/SparkCompare.scala

private def compareSchemaDataFrames(left: DataFrame , leftViewName: String
                              , right: DataFrame , rightViewName: String) :Pair[DataFrame, DataFrame] = {
    //make sure that column names match in both dataFrames
    if (!left.columns.sameElements(right.columns))
      {
        println("column names were different")
        throw new Exception("Column Names Did Not Match")
      }

    val leftCols = left.columns.mkString(",")
    val rightCols = right.columns.mkString(",")

    //group by all columns in both data frames
    val groupedLeft = left.sqlContext.sql("select " + leftCols + " , count(*) as recordRepeatCount from " +  leftViewName + " group by " + leftCols )
    val groupedRight = left.sqlContext.sql("select " + rightCols + " , count(*) as recordRepeatCount from " +  rightViewName + " group by " + rightCols )

    //do the except/subtract command
    val inLnotinR = groupedLeft.except(groupedRight).toDF()
    val inRnotinL = groupedRight.except(groupedLeft).toDF()

    return new ImmutablePair[DataFrame, DataFrame](inLnotinR, inRnotinL)
  }

see below the utility function I used to compare two dataframes using the following criteria请参阅下面我用来使用以下标准比较两个数据帧的实用程序函数

  1. Column length柱长
  2. Record count记录数
  3. Column by column comparing for all records逐列比较所有记录

Task three is done by using a hash of concatenation of all columns in a record.任务三是通过使用记录中所有列的串联散列来完成的。

def verifyMatchAndSaveSignatureDifferences(oldDF: DataFrame, newDF: DataFrame, pkColumn: String) : Long = {
  assert(oldDF.columns.length == newDF.columns.length, s"column lengths don't match")
  assert(oldDF.count == newDF.count, s"record count don't match")

  def createHashColumn(df: DataFrame) : Column = {
     val colArr = df.columns
     md5(concat_ws("", (colArr.map(col(_))) : _*))
  }

  val newSigDF = newDF.select(col(pkColumn), createHashColumn(newDF).as("signature_new"))
  val oldSigDF = oldDF.select(col(pkColumn), createHashColumn(oldDF).as("signature"))

  val joinDF = newSigDF.join(oldSigDF, newSigDF("pkColumn") === oldSigDF("pkColumn")).where($"signature" !== $"signature_new").cache

  val diff = joinDF.count
  //write out any recorsd that don't match
  if (diff > 0)
     joinDF.write.saveAsTable("signature_table")

  joinDF.unpersist()

  diff
}

If the method returns 0, then both dataframes are exactly the same in everything else, a table named signature_table in default schema of hive will contains all records that differ in both.如果该方法返回 0,则两个数据帧在其他所有方面都完全相同,hive 的默认架构中名为 signature_table 的表将包含两者不同的所有记录。

Hope this helps.希望这会有所帮助。

Using Spark different join types seems to be the key to computing Deletions, Additions, and Updates on rows.使用 Spark 不同的连接类型似乎是计算行上的删除、添加和更新的关键。

This question illustrates the different types of joins depending on what you are trying to achieve: What are the various join types in Spark?这个问题说明了不同类型的连接,具体取决于您要实现的目标: Spark 中有哪些不同的连接类型?

Let's say we have two DataFrame s, z1 and z1.假设我们有两个DataFrame ,z1 和 z1。 Option 1 is good for rows without duplicates.选项 1 适用于没有重复的行。 You can try these in spark-shell .您可以在spark-shell尝试这些。

  • Option 1: do except directly选项1:直接做except

val inZ1NotInZ2 = z1.except(z2).toDF()
val inZ2NotInZ1 = z2.except(z1).toDF()

inZ1NotInZ2.show
inZ2NotInZ1.show
  • Option 2: use GroupBy (for DataFrame with duplicate rows)选项 2:使用GroupBy (对于具有重复行的 DataFrame)
val z1Grouped = z1.groupBy(z1.columns.map(c => z1(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val z2Grouped = z2.groupBy(z2.columns.map(c => z2(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")

val inZ1NotInZ2 = z1Grouped.except(z2Grouped).toDF()
val inZ2NotInZ1 = z2Grouped.except(z1Grouped).toDF()

  • Option 3, use exceptAll , which should also work for data with duplicate rows选项 3,使用exceptAll ,它也适用于具有重复行的数据
// Source Code: https://github.com/apache/spark/blob/50538600ec/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2029
val inZ1NotInZ2 = z1.exceptAll(z2).toDF()
val inZ2NotInZ1 = z2.exceptAll(z1).toDF()

Spark version: 2.2.0星火版本:2.2.0

Use both except and left anti join使用 except 和 left 反连接

df2.except(df1) will be like: df2.except(df1) 将类似于:

city城市 product产品 date日期 sale销售 exp经验值 wastage浪费
city 3城市 3 prod 4产品 4 9/18/2017 2017/9/18 230 230 431 431 169 169
city 1城市 1 prod 4产品 4 9/27/2017 2017/9/27 350 350 90 90 190 190
city 1城市 1 prod 3产品 3 9/9/2017 2017/9/9 230 230 430 430 160 160

just as koiralo said, but the deleted item 'city 2 prod 1' is lost, so we need left anti join(or left join with filters):正如koiralo所说,但删除的项目'city 2 prod 1'丢失了,所以我们需要左反连接(或带过滤器的左连接):

select * from df1 left anti join df2 on df1.city=df2.city and df1.product=df2.product

then union the results of df2.except(df1) and left anti join然后联合 df2.except(df1) 的结果并离开 anti join

But I didn't test the performance of left anti join on large dataset但是我没有在大数据集上测试left anti join的性能

PS: If your spark version is over 2.4, using spark-extension will be easier PS:如果你的spark版本超过2.4,使用spark-extension会更方便

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM