[英]How to write scala unit tests to compare spark dataframes?
Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same. 目的 - 检查spark生成的数据帧和手动创建的数据帧是否相同。
Earlier implementation which worked - 早期的实施工作 -
if (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - true
返回布尔值 - true
Where da and ds are the generated dataframe and the created dataframe respectively. 其中da和ds分别是生成的数据帧和创建的数据帧。
Here I am running the program via the spark-shell. 在这里,我通过spark-shell运行程序。
Newer Implementation which doesn't work - 较新的实施不起作用 -
assert (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - false
返回布尔值 - false
Where da and ds are the generated dataframe and the created dataframe respectively. 其中da和ds分别是生成的数据帧和创建的数据帧。
Here I am using the assert method of scalatest instead, but the returned result is not returning as true. 这里我使用scalatest的assert方法,但返回的结果不是返回true。
Why try to use the new implementation when previous method worked? 为什么在以前的方法工作时尝试使用新的实现? To have sbt use scalatest to always run the test file via sbt test
or while compiling. 让sbt使用scalatest总是通过sbt test
或编译时运行测试文件。
The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt. 在通过spark-shell运行时比较spark数据帧的相同代码给出了正确的输出,但在sbt中使用scalatest运行时显示错误。
The two programs are effectively the same but the results are different. 这两个程序实际上是相同的,但结果是不同的。 What could be the problem? 可能是什么问题呢?
Tests for compare dataframes exists in Spark Core, example: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala Spark Core中存在比较数据帧的测试,例如: https : //github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes. 具有测试的库共享代码(SharedSQLContext,ect)存在于Maven仓库中,您可以将它们包含在项目中,并使用“checkAnswer”方法来比较数据帧。
I solved the issue by using this as a dependency https://github.com/MrPowers/spark-fast-tests . 我使用它作为依赖项https://github.com/MrPowers/spark-fast-tests解决了这个问题。
Another solution would be to iterate over the members of the dataframe individually and compare them. 另一种解决方案是单独迭代数据帧的成员并进行比较。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.