简体   繁体   English

如何编写scala单元测试来比较spark数据帧?

[英]How to write scala unit tests to compare spark dataframes?

Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same. 目的 - 检查spark生成的数据帧和手动创建的数据帧是否相同。

Earlier implementation which worked - 早期的实施工作 -

if (da.except(ds).count() != 0 && ds.except(da).count != 0)

Boolean returned - true 返回布尔值 - true

Where da and ds are the generated dataframe and the created dataframe respectively. 其中da和ds分别是生成的数据帧和创建的数据帧。

Here I am running the program via the spark-shell. 在这里,我通过spark-shell运行程序。

Newer Implementation which doesn't work - 较新的实施不起作用 -

assert (da.except(ds).count() != 0 && ds.except(da).count != 0)

Boolean returned - false 返回布尔值 - false

Where da and ds are the generated dataframe and the created dataframe respectively. 其中da和ds分别是生成的数据帧和创建的数据帧。

Here I am using the assert method of scalatest instead, but the returned result is not returning as true. 这里我使用scalatest的assert方法,但返回的结果不是返回true。

Why try to use the new implementation when previous method worked? 为什么在以前的方法工作时尝试使用新的实现? To have sbt use scalatest to always run the test file via sbt test or while compiling. 让sbt使用scalatest总是通过sbt test或编译时运行测试文件。

The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt. 在通过spark-shell运行时比较spark数据帧的相同代码给出了正确的输出,但在sbt中使用scalatest运行时显示错误。

The two programs are effectively the same but the results are different. 这两个程序实际上是相同的,但结果是不同的。 What could be the problem? 可能是什么问题呢?

Tests for compare dataframes exists in Spark Core, example: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala Spark Core中存在比较数据帧的测试,例如: https//github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes. 具有测试的库共享代码(SharedSQLContext,ect)存在于Maven仓库中,您可以将它们包含在项目中,并使用“checkAnswer”方法来比较数据帧。

I solved the issue by using this as a dependency https://github.com/MrPowers/spark-fast-tests . 我使用它作为依赖项https://github.com/MrPowers/spark-fast-tests解决了这个问题。

Another solution would be to iterate over the members of the dataframe individually and compare them. 另一种解决方案是单独迭代数据帧的成员并进行比较。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Scala (spark) 中比较两个数据框中的列 - How to compare columns in two dataframes in Scala (spark) 比较Apache Spark Scala中的2个数据帧 - Compare 2 dataframes in Apache Spark Scala 如何编写Spark Streaming程序的单元测试? - How to write unit tests for Spark Streaming programs? 如何在 Spark 2.0+ 中编写单元测试? - How to write unit tests in Spark 2.0+? 如何使用Scala在Spark Dataframes中比较两列数据 - How to compare two columns data in Spark Dataframes using Scala spark scala 比较具有时间戳列的数据帧 - spark scala compare dataframes having timestamp column 我们如何比较 spark scala 中的两个数据框以找出这两个文件之间的差异,哪一列? 和价值? - How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value? Scala Spark,比较两个DataFrames和select另一列的值 - Scala Spark, compare two DataFrames and select the value of another column 如何有效地将 ListBuffer[ListBuffer[String]] 转换为多个数据帧并使用 Spark Scala 编写它们 - How to efficiently convert ListBuffer[ListBuffer[String]] into multiple dataframes and write them using Spark Scala 如何在 Scala 和 Apache Spark 中加入两个 DataFrame? - How to join two DataFrames in Scala and Apache Spark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM