简体   繁体   English

运行多个 sql 查询和测试通过或失败 Spark Scala

[英]Running multiple sql queries and testing for pass or fail Spark Scala

I am running 100 queries (test cases) to check for data quality in Spark Scala.我正在运行 100 个查询(测试用例)来检查 Spark Scala 中的数据质量。 I am querying data from a hive table.我正在从 hive 表中查询数据。

An empty data frame is the expected result for these sample queries:空数据框是这些示例查询的预期结果:

SELECT car_type FROM car_data WHERE car_version is null
SELECT car_color FROM car_data WHERE car_date is null
SELECT car_sale FROM car_data WHERE car_timestamp is null

I want to write if the test case passed or failed based on the expected result to a text file.我想根据预期结果将测试用例通过或失败写入文本文件。 I want to know the best way to accomplish this.我想知道实现这一目标的最佳方法。

What I have so far:到目前为止我所拥有的:

val test_1 = context.sql("SELECT car_type FROM car_data WHERE car_version is null")
val test_2 = context.sql("SELECT car_color FROM car_data WHERE car_date is null")
val test_3 = context.sql("SELECT car_sale FROM car_data WHERE car_timestamp is null")
test_1.head(1).isEmpty 

If you want to know if any values are NULL , you can use conditional aggregation.如果您想知道是否有任何值是NULL ,您可以使用条件聚合。 I would be inclined to run all the tests with one query:我倾向于使用一个查询来运行所有测试:

SELECT (CASE WHEN COUNT(*) = COUNT(car_type) THEN 'PASS' ELSE 'FAIL' END) as car_type_test,
       (CASE WHEN COUNT(*) = COUNT(car_color) THEN 'PASS' ELSE 'FAIL' END) as car_color_test,
       (CASE WHEN COUNT(*) = COUNT(car_sale) THEN 'PASS' ELSE 'FAIL' END) as car_sale_test       
FROM car_data;

Note: This considers an empty table to pass the test, whereas your code would not.注意:这认为一个空表可以通过测试,而您的代码不会。 These can be easily modified to handle that case, but this behavior makes sense to me.这些可以很容易地修改以处理这种情况,但这种行为对我来说很有意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM