简体   繁体   中英

Running multiple sql queries and testing for pass or fail Spark Scala

I am running 100 queries (test cases) to check for data quality in Spark Scala. I am querying data from a hive table.

An empty data frame is the expected result for these sample queries:

SELECT car_type FROM car_data WHERE car_version is null
SELECT car_color FROM car_data WHERE car_date is null
SELECT car_sale FROM car_data WHERE car_timestamp is null

I want to write if the test case passed or failed based on the expected result to a text file. I want to know the best way to accomplish this.

What I have so far:

val test_1 = context.sql("SELECT car_type FROM car_data WHERE car_version is null")
val test_2 = context.sql("SELECT car_color FROM car_data WHERE car_date is null")
val test_3 = context.sql("SELECT car_sale FROM car_data WHERE car_timestamp is null")
test_1.head(1).isEmpty 

If you want to know if any values are NULL , you can use conditional aggregation. I would be inclined to run all the tests with one query:

SELECT (CASE WHEN COUNT(*) = COUNT(car_type) THEN 'PASS' ELSE 'FAIL' END) as car_type_test,
       (CASE WHEN COUNT(*) = COUNT(car_color) THEN 'PASS' ELSE 'FAIL' END) as car_color_test,
       (CASE WHEN COUNT(*) = COUNT(car_sale) THEN 'PASS' ELSE 'FAIL' END) as car_sale_test       
FROM car_data;

Note: This considers an empty table to pass the test, whereas your code would not. These can be easily modified to handle that case, but this behavior makes sense to me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM