简体   繁体   English

对于基本数据框创建示例,我应该如何在Spark中编写单元测试?

[英]How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows. 我正在努力编写一个基本单元测试来创建数据框,使用Spark提供的示例文本文件,如下所示。

class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {

private val master = "local[*]"
private val appName = "data_load_testing"

private var spark: SparkSession = _

override def beforeEach() {
  spark = new SparkSession.Builder().appName(appName).getOrCreate()
}

import spark.implicits._

 case class Person(name: String, age: Int)

  val df = spark.sparkContext
      .textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
      .map(_.split(","))
      .map(attributes => Person(attributes(0),attributes(1).trim.toInt))
      .toDF()

  test("Creating dataframe should produce data from of correct size") {
  assert(df.count() == 3)
  assert(df.take(1).equals(Array("Michael",29)))
}

override def afterEach(): Unit = {
  spark.stop()
}

} }

I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; 我知道代码本身是有效的(来自spark.implicits._ .... toDF()),因为我已经在Spark-Scala shell中验证了这一点,但在测试类中我遇到了很多错误; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run. IDE无法识别'import spark.implicits._或toDF(),因此测试不会运行。

I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood. 我正在使用SparkSession,它自动创建SparkConf,SparkContext和SQLContext。

My code simply uses the example code from the Spark repo. 我的代码只使用Spark repo中的示例代码。

Any ideas why this is not working? 任何想法为什么这不起作用? Thanks! 谢谢!

NB. NB。 I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+? 我已经看过StackOverflow上的Spark单元测试问题,如下所示: 如何在Spark 2.0+中编写单元测试? I have used this to write the test but I'm still getting the errors. 我用它来编写测试,但我仍然得到错误。

I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. 我正在使用Scala 2.11.8和Spark 2.2.0与SBT和IntelliJ。 These dependencies are correctly included within the SBT build file. 这些依赖项正确包含在SBT构建文件中。 The errors on running the tests are: 运行测试时的错误是:

Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person] possible cause: maybe a semicolon is missing before `value toDF'? 错误:(29,10)值toDF不是org.apache.spark.rdd.RDD [dataLoadTest.this.Person]的成员可能的原因:可能在`to toFF'之前缺少分号? .toDF() .toDF()

Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found. 错误:(20,20)需要稳定的标识符,但找到了dataLoadTest.this.spark.implicits。 import spark.implicits._ import spark.implicits._

IntelliJ won't recognise import spark.implicits._ or the .toDF() method. IntelliJ将无法识别导入spark.implicits._或.toDF()方法。

I have imported: import org.apache.spark.sql.SparkSession import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers} 我导入了:import org.apache.spark.sql.SparkSession import org.scalatest。{BeforeAndAfterEach,FlatSpec,FunSuite,Matchers}

you need to assign sqlContext to a val for implicits to work . 你需要将sqlContext分配给val以使implicits工作。 Since your sparkSession is a var , implicits won't work with it 由于你的sparkSession是一个var ,因此implicits不适用

So you need to do 所以你需要这样做

val sQLContext = spark.sqlContext
import sQLContext.implicits._

Moreover you can write functions for your tests so that your test class looks as following 此外,您可以为测试编写函数,以便您的测试类看起来如下所示

    class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {

  private val master = "local[*]"
  private val appName = "data_load_testing"

  var spark: SparkSession = _

  override def beforeEach() {
    spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
  }


  test("Creating dataframe should produce data from of correct size") {
    val sQLContext = spark.sqlContext
    import sQLContext.implicits._

    val df = spark.sparkContext
    .textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
    .map(_.split(","))
    .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
    .toDF()

    assert(df.count() == 3)
    assert(df.take(1)(0)(0).equals("Michael"))
  }

  override def afterEach() {
    spark.stop()
  }

}
case class Person(name: String, age: Int)

There are many libraries for unit testing of spark, one of the mostly used is 有许多库用于火花的单元测试,其中一个最常用的是

spark-testing-base : By Holden Karau spark-testing-base作者Holden Karau

This library have all with sc as the SparkContext below is a simple example 这个库都带有sc因为下面的SparkContext是一个简单的例子

class TestSharedSparkContext extends FunSuite with SharedSparkContext {

  val expectedResult = List(("a", 3),("b", 2),("c", 4))

  test("Word counts should be equal to expected") {
    verifyWordCount(Seq("c a a b a c b c c"))
  }

  def verifyWordCount(seq: Seq[String]): Unit = {
    assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
  }
}

Here, every thing is prepared with sc as a SparkContext 在这里,每个东西都用sc作为SparkContext

Another approach is to create a TestWrapper and use for the multiple testcases as below 另一种方法是创建一个TestWrapper并用于多个testcases ,如下所示

import org.apache.spark.sql.SparkSession

trait TestSparkWrapper {

  lazy val sparkSession: SparkSession = 
    SparkSession.builder().master("local").appName("spark test example ").getOrCreate()

}

And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach . 并使用此TestWrapper所有tests使用Scala测试,以打BeforeAndAfterAllBeforeAndAfterEach

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM