简体   繁体   中英

How to write scala unit tests to compare spark dataframes?

Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same.

Earlier implementation which worked -

if (da.except(ds).count() != 0 && ds.except(da).count != 0)

Boolean returned - true

Where da and ds are the generated dataframe and the created dataframe respectively.

Here I am running the program via the spark-shell.

Newer Implementation which doesn't work -

assert (da.except(ds).count() != 0 && ds.except(da).count != 0)

Boolean returned - false

Where da and ds are the generated dataframe and the created dataframe respectively.

Here I am using the assert method of scalatest instead, but the returned result is not returning as true.

Why try to use the new implementation when previous method worked? To have sbt use scalatest to always run the test file via sbt test or while compiling.

The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt.

The two programs are effectively the same but the results are different. What could be the problem?

Tests for compare dataframes exists in Spark Core, example: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes.

I solved the issue by using this as a dependency https://github.com/MrPowers/spark-fast-tests .

Another solution would be to iterate over the members of the dataframe individually and compare them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM