[英]Check matching data in Dataframes in spark
如果按照我的方式生成的两个 RDD 包含相同的数据(包括行数),我该如何匹配? 我正在使用 scala 测试来运行测试并使用 scala 2.12.12 启动版本 3 在创建我的两个 rdd 模式的代码下方,包括预期的一个和创建所有 3 个 rdd 数据。
-- CREATING SCHEMA FOR RDD AMOUNTS AND WAREHOUSE AND EXPECTED FINAL SCHEMA
val amountsSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("amount", DecimalType(10, 2), nullable = true),
StructField("eventTime",LongType, nullable = true),
)
)
val warehouseSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("warehouse", StringType, nullable = true),
StructField("product", StringType, nullable = true),
StructField("eventTime",LongType, nullable = true),
)
)
val expectedDfSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("warehouse", StringType, nullable = true),
StructField("product", StringType, nullable = true),
StructField("amount", DecimalType(10, 2), nullable = true),
)
)
--- CREATING DATA FOR RDD AMOUNTS RDD AND WAREHOUSE RDD AND EXPECTED FINAL RDD
val amounts_data = Seq(
Row("1", "5.00", "1528463387"),
Row("1", "7.20", "1528463005"),
Row("2", "5.00", "1528463097"),
Row("2", "7.20", "1528463007"),
Row("3", "6.00", "1528463078"),
Row("4", "24.20", "1528463008"),
Row("4", "15.00", "1528463100"),
)
val wh_data = Seq(
Row("1", "W-1", "P-1", "1528463098"),
Row("2", "W-2", "P-2", "1528463097"),
Row("3", "W-2", "P-3", "1528463078"),
Row("4", "W-1", "P-6", "1528463100"),
)
val expected_data = Seq(
Row("1", "W-1", "P-1", "5.00"),
Row("2", "W-2", "P-2", "5.00"),
Row("3", "W-2", "P-3", "6.00"),
Row("4", "W-1", "P-6", "15.00")
)
---- CREATING RDD WITH SCHEMAS AND DATA FOR DF_AMOUNTS AND DF_WAREHOUSE AND FOR THE EXPECTED RDD WITH EXPECTED_DATA
val df_amounts: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(amounts_data),
amountsSchema
)
val df_wh: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(wh_data),
warehouseSchema
)
val df_expected: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(expected_data),
expectedDfSchema
)
---- USING GET_AMOUNTS METHOD TO GENERATE A RDD FROM THE FUNCTION get_amounts
val resDf: DataFrame = get_amounts(df_amounts, df_wh)
---- TESTING IF THE resDf SCHEMA MATCH WITH THE EXPECTED SCHEMA - IT DOES TEST PASSED
test("DataFrame Schema Test") {
assert(assertSchema(resDf.schema, df_expected.schema))
}
---- TESTING IF THE resDf DATA MATCH WITH THE EXPECTED DATA - IT DOESNT'T MATCH
test("DataFrame Data Test") {
assert(assertData(resDf, df_expected))
}
}
assertData function 用于匹配预期数据 rdd 的数据和来自我的 function get_amounts 的数据,但它未通过测试。
def assertData(df1: DataFrame, df2: DataFrame): Boolean = {
df1.exceptAll(df2).rdd.isEmpty()
}
谢谢
您创建数据集的方式是有效的。 测试结构看起来也不错。 我建议改进您的断言方法以查看测试用例失败的原因,在这里您可以找到关于您的测试方法的一些想法:
exceptAll
不是测试的完美选择,如果df2
包含额外的行,它仍然会说数据匹配,请考虑以下代码: val df1 = Seq(
(1, "x"),
(2, "y")
).toDF("x", "y")
val df2 = Seq(
(1, "x"),
(2, "y"),
(3, "z")
).toDF("x", "y")
assert(df1.exceptAll(df2).rdd.isEmpty())
“这个 function 通过 position(不是按名称)解析列” (来自 Spark 代码 scala 文档),因此有时您会对测试结果感到困惑。
您的断言方法没有说明完全不匹配的内容
出于测试目的收集(少量)数据和匹配序列也不错。 您可以考虑使用这样的方法:
def matchDF(resultDF: DataFrame, expectedDF: DataFrame): Unit = {
resultDF.printSchema()
expectedDF.printSchema()
assert(resultDF.schema == expectedDF.schema,
s"Schema does not match: ${resultDF.schema} != ${expectedDF.schema}")
val expected = expectedDF.collect().toSeq
val result = resultDF.collect().toSeq
assert(expected == result, s"Data does not match: $result != $expected")
}
这不是一个完美的方法(仍然取决于连续的 position),但至少您将能够找出发生了什么以及测试失败的原因。
对于错误的数据,您会看到:
assertion failed: Data does not match: WrappedArray([1,x], [2,y]) != WrappedArray([1,x], [3,y])
对于错误的架构,您将得到:
root
|-- x: integer (nullable = false)
|-- y: string (nullable = true)
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
Exception in thread "main" java.lang.AssertionError: assertion failed: Schema does not match
我希望这会帮助您了解出了什么问题。
问题未解决?试试以下方法:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.