繁体   English   中英

检查spark中Dataframes中的匹配数据

[英]Check matching data in Dataframes in spark

如果按照我的方式生成的两个 RDD 包含相同的数据(包括行数),我该如何匹配? 我正在使用 scala 测试来运行测试并使用 scala 2.12.12 启动版本 3 在创建我的两个 rdd 模式的代码下方,包括预期的一个和创建所有 3 个 rdd 数据。

-- CREATING SCHEMA FOR RDD AMOUNTS AND WAREHOUSE AND EXPECTED FINAL SCHEMA

  val amountsSchema: StructType = StructType(
    Seq(
      StructField("positionId", LongType, nullable = true),
      StructField("amount", DecimalType(10, 2), nullable = true),
      StructField("eventTime",LongType, nullable = true),
    )
  )

  val warehouseSchema: StructType = StructType(
    Seq(
      StructField("positionId", LongType, nullable = true),
      StructField("warehouse", StringType, nullable = true),
      StructField("product", StringType, nullable = true),
      StructField("eventTime",LongType, nullable = true),
    )
  )

  val expectedDfSchema: StructType = StructType(
    Seq(
      StructField("positionId", LongType, nullable = true),
      StructField("warehouse", StringType, nullable = true),
      StructField("product", StringType, nullable = true),
      StructField("amount", DecimalType(10, 2), nullable = true),
    )
  )

--- CREATING DATA FOR RDD AMOUNTS RDD AND WAREHOUSE RDD AND EXPECTED FINAL RDD

  val amounts_data = Seq(
    Row("1", "5.00", "1528463387"),
    Row("1", "7.20", "1528463005"),
    Row("2", "5.00", "1528463097"),
    Row("2", "7.20", "1528463007"),
    Row("3", "6.00", "1528463078"),
    Row("4", "24.20", "1528463008"),
    Row("4", "15.00", "1528463100"),
  )


  val wh_data = Seq(
    Row("1", "W-1", "P-1", "1528463098"),
    Row("2", "W-2", "P-2", "1528463097"),
    Row("3", "W-2", "P-3", "1528463078"),
    Row("4", "W-1", "P-6", "1528463100"),
  )

  val expected_data = Seq(
    Row("1", "W-1", "P-1", "5.00"),
    Row("2", "W-2", "P-2", "5.00"),
    Row("3", "W-2", "P-3", "6.00"),
    Row("4", "W-1", "P-6", "15.00")
  )

---- CREATING RDD WITH SCHEMAS AND DATA FOR DF_AMOUNTS AND DF_WAREHOUSE AND FOR THE EXPECTED RDD WITH EXPECTED_DATA

  val df_amounts: DataFrame = spark.createDataFrame(
    spark.sparkContext.parallelize(amounts_data),
    amountsSchema
  )

  val df_wh: DataFrame = spark.createDataFrame(
    spark.sparkContext.parallelize(wh_data),
    warehouseSchema
  )

  val df_expected: DataFrame = spark.createDataFrame(
    spark.sparkContext.parallelize(expected_data),
    expectedDfSchema
  )

---- USING GET_AMOUNTS METHOD TO GENERATE A RDD FROM THE FUNCTION get_amounts



  val resDf: DataFrame = get_amounts(df_amounts, df_wh)



---- TESTING IF THE resDf SCHEMA MATCH WITH THE EXPECTED SCHEMA - IT DOES TEST PASSED

  test("DataFrame Schema Test") {
    assert(assertSchema(resDf.schema, df_expected.schema))
  }

---- TESTING IF THE resDf DATA MATCH WITH THE EXPECTED DATA - IT DOESNT'T MATCH
  test("DataFrame Data Test") {
    assert(assertData(resDf, df_expected))
  }
}

assertData function 用于匹配预期数据 rdd 的数据和来自我的 function get_amounts 的数据,但它未通过测试。

def assertData(df1: DataFrame, df2: DataFrame): Boolean = {
    df1.exceptAll(df2).rdd.isEmpty()
  }

谢谢

您创建数据集的方式是有效的。 测试结构看起来也不错。 我建议改进您的断言方法以查看测试用例失败的原因,在这里您可以找到关于您的测试方法的一些想法:

  • exceptAll不是测试的完美选择,如果df2包含额外的行,它仍然会说数据匹配,请考虑以下代码:
  val df1 = Seq(
    (1, "x"),
    (2, "y")
  ).toDF("x", "y")
  
  val df2 = Seq(
    (1, "x"),
    (2, "y"),
    (3, "z")
  ).toDF("x", "y")
  
  assert(df1.exceptAll(df2).rdd.isEmpty())
  • “这个 function 通过 position(不是按名称)解析列” (来自 Spark 代码 scala 文档),因此有时您会对测试结果感到困惑。

  • 您的断言方法没有说明完全不匹配的内容

出于测试目的收集(少量)数据和匹配序列也不错。 您可以考虑使用这样的方法:

  def matchDF(resultDF: DataFrame, expectedDF: DataFrame): Unit = {
    resultDF.printSchema()
    expectedDF.printSchema()
    assert(resultDF.schema == expectedDF.schema, 
        s"Schema does not match: ${resultDF.schema} != ${expectedDF.schema}")
    val expected = expectedDF.collect().toSeq
    val result = resultDF.collect().toSeq
    assert(expected == result, s"Data does not match: $result != $expected")
  }

这不是一个完美的方法(仍然取决于连续的 position),但至少您将能够找出发生了什么以及测试失败的原因。

对于错误的数据,您会看到:

assertion failed: Data does not match: WrappedArray([1,x], [2,y]) != WrappedArray([1,x], [3,y])

对于错误的架构,您将得到:

root
 |-- x: integer (nullable = false)
 |-- y: string (nullable = true)

root
 |-- x: string (nullable = true)
 |-- y: string (nullable = true)

Exception in thread "main" java.lang.AssertionError: assertion failed: Schema does not match

我希望这会帮助您了解出了什么问题。

问题未解决?试试以下方法:

检查spark中Dataframes中的匹配数据

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2023 STACKOOM.COM