检查spark中Dataframes中的匹配数据

Question

How can i match if two RDD generated the way i did contains the same data including number of rows?如果按照我的方式生成的两个 RDD 包含相同的数据（包括行数），我该如何匹配？ I'm using scala test to run the tests and spark version 3 with scala 2.12.12 Below my code from the creation of the schemas of my two rdd, included the expected one and the creation of all 3 rdd with data.我正在使用 scala 测试来运行测试并使用 scala 2.12.12 启动版本 3 在创建我的两个 rdd 模式的代码下方，包括预期的一个和创建所有 3 个 rdd 数据。

-- CREATING SCHEMA FOR RDD AMOUNTS AND WAREHOUSE AND EXPECTED FINAL SCHEMA

  val amountsSchema: StructType = StructType(
    Seq(
      StructField("positionId", LongType, nullable = true),
      StructField("amount", DecimalType(10, 2), nullable = true),
      StructField("eventTime",LongType, nullable = true),
    )
  )

  val warehouseSchema: StructType = StructType(
    Seq(
      StructField("positionId", LongType, nullable = true),
      StructField("warehouse", StringType, nullable = true),
      StructField("product", StringType, nullable = true),
      StructField("eventTime",LongType, nullable = true),
    )
  )

  val expectedDfSchema: StructType = StructType(
    Seq(
      StructField("positionId", LongType, nullable = true),
      StructField("warehouse", StringType, nullable = true),
      StructField("product", StringType, nullable = true),
      StructField("amount", DecimalType(10, 2), nullable = true),
    )
  )

--- CREATING DATA FOR RDD AMOUNTS RDD AND WAREHOUSE RDD AND EXPECTED FINAL RDD

  val amounts_data = Seq(
    Row("1", "5.00", "1528463387"),
    Row("1", "7.20", "1528463005"),
    Row("2", "5.00", "1528463097"),
    Row("2", "7.20", "1528463007"),
    Row("3", "6.00", "1528463078"),
    Row("4", "24.20", "1528463008"),
    Row("4", "15.00", "1528463100"),
  )


  val wh_data = Seq(
    Row("1", "W-1", "P-1", "1528463098"),
    Row("2", "W-2", "P-2", "1528463097"),
    Row("3", "W-2", "P-3", "1528463078"),
    Row("4", "W-1", "P-6", "1528463100"),
  )

  val expected_data = Seq(
    Row("1", "W-1", "P-1", "5.00"),
    Row("2", "W-2", "P-2", "5.00"),
    Row("3", "W-2", "P-3", "6.00"),
    Row("4", "W-1", "P-6", "15.00")
  )

---- CREATING RDD WITH SCHEMAS AND DATA FOR DF_AMOUNTS AND DF_WAREHOUSE AND FOR THE EXPECTED RDD WITH EXPECTED_DATA

  val df_amounts: DataFrame = spark.createDataFrame(
    spark.sparkContext.parallelize(amounts_data),
    amountsSchema
  )

  val df_wh: DataFrame = spark.createDataFrame(
    spark.sparkContext.parallelize(wh_data),
    warehouseSchema
  )

  val df_expected: DataFrame = spark.createDataFrame(
    spark.sparkContext.parallelize(expected_data),
    expectedDfSchema
  )

---- USING GET_AMOUNTS METHOD TO GENERATE A RDD FROM THE FUNCTION get_amounts



  val resDf: DataFrame = get_amounts(df_amounts, df_wh)



---- TESTING IF THE resDf SCHEMA MATCH WITH THE EXPECTED SCHEMA - IT DOES TEST PASSED

  test("DataFrame Schema Test") {
    assert(assertSchema(resDf.schema, df_expected.schema))
  }

---- TESTING IF THE resDf DATA MATCH WITH THE EXPECTED DATA - IT DOESNT'T MATCH
  test("DataFrame Data Test") {
    assert(assertData(resDf, df_expected))
  }
}

The assertData function used to match the data for the expected data rdd and the one coming from my function get_amounts but it fails the test. assertData function 用于匹配预期数据 rdd 的数据和来自我的 function get_amounts 的数据，但它未通过测试。

def assertData(df1: DataFrame, df2: DataFrame): Boolean = {
    df1.exceptAll(df2).rdd.isEmpty()
  }

Thank You谢谢

Answer 1

The way you create a datasets is valid.您创建数据集的方式是有效的。 A test structure looks good as well.测试结构看起来也不错。 I would suggest to improve your assert method to see why the test case failes, here you can find some thoughts on your testing method:我建议改进您的断言方法以查看测试用例失败的原因，在这里您可以找到关于您的测试方法的一些想法：

exceptAll is not a perfect for testing, if the df2 contains an additional row it will still say that the data matches, consider below code: exceptAll不是测试的完美选择，如果df2包含额外的行，它仍然会说数据匹配，请考虑以下代码：

  val df1 = Seq(
    (1, "x"),
    (2, "y")
  ).toDF("x", "y")
  
  val df2 = Seq(
    (1, "x"),
    (2, "y"),
    (3, "z")
  ).toDF("x", "y")
  
  assert(df1.exceptAll(df2).rdd.isEmpty())

"this function resolves columns by position (not by name)" (from Spark code scala docs), due to this sometimes you can get confused about your test result. “这个 function 通过 position（不是按名称）解析列” （来自 Spark 代码 scala 文档），因此有时您会对测试结果感到困惑。
your assert method says nothing about what exactly mismatched您的断言方法没有说明完全不匹配的内容

For testing purposes is not bad to collect (small amount of) data and match sequences.出于测试目的收集（少量）数据和匹配序列也不错。 You can consider using a method like this one:您可以考虑使用这样的方法：

  def matchDF(resultDF: DataFrame, expectedDF: DataFrame): Unit = {
    resultDF.printSchema()
    expectedDF.printSchema()
    assert(resultDF.schema == expectedDF.schema, 
        s"Schema does not match: ${resultDF.schema} != ${expectedDF.schema}")
    val expected = expectedDF.collect().toSeq
    val result = resultDF.collect().toSeq
    assert(expected == result, s"Data does not match: $result != $expected")
  }

It's not a perfect approach (still depends on the position in a row), but at least you will be able to find out what is going on and why your test fails.这不是一个完美的方法（仍然取决于连续的 position），但至少您将能够找出发生了什么以及测试失败的原因。

For wrong data you'll see this:对于错误的数据，您会看到：

assertion failed: Data does not match: WrappedArray([1,x], [2,y]) != WrappedArray([1,x], [3,y])

For wrong schema you'll get:对于错误的架构，您将得到：

root
 |-- x: integer (nullable = false)
 |-- y: string (nullable = true)

root
 |-- x: string (nullable = true)
 |-- y: string (nullable = true)

Exception in thread "main" java.lang.AssertionError: assertion failed: Schema does not match

I hope this will help you understand what is going wrong.我希望这会帮助您了解出了什么问题。

检查spark中Dataframes中的匹配数据

问题描述

1 个解决方案

解决方案1
0 2023-02-02 13:57:30

检查spark中Dataframes中的匹配数据

问题描述

1 个解决方案

解决方案1 0 2023-02-02 13:57:30

解决方案1
0 2023-02-02 13:57:30