简体   繁体   中英

Creation of test spark delta table very slow

I am attempting to write some test cases for our spark logic by creating tiny input delta tables with known values. However I am noticing that the creation of a single item delta table is taking a very long time, about ~6 seconds per table. This quickly adds up and some test cases that use multiple tables are taking minutes to run!

I accept that spark tests will also be on the slow side, but similar tests with parquet have creation speeds of around 400ms which would be tolerable

I am running these on these tests on Windows which could be contributing to my issues but other formats seem to run fine and orders of magnitude faster

The test case i'm using to generate my timings is

  "delta" should "create in a reasonable time" in {

    val spark: SparkSession = SparkSession.builder
      .master("local[1]")
      .getOrCreate()

    import spark.implicits._

    // This takes ~15seconds but most of that can be attributed to spark warming up
    val preloadStart = System.currentTimeMillis()
    Seq(("test-1", "my-test"))
      .toDF("Id", "Source")
      .write
      .format("delta")
      .save(s"c:/tmp/test-${java.util.UUID.randomUUID()}")
    val preloadEnd = System.currentTimeMillis()
    println("Preload Elapsed time: " + (preloadEnd - preloadStart) + "ms")

    //actual test, why does this take ~6seconds?!?
    val testStart = System.currentTimeMillis()
    Seq(("test-2", "my-test"))
      .toDF("Id", "Source")
      .write
      .format("delta")
      .save(s"c:/tmp/test-${java.util.UUID.randomUUID()}")
    val testEnd = System.currentTimeMillis()
    println("Test Elapsed time: " + (testEnd - testStart) + "ms")
  }

Is there a configuration values I am missing or some otherway to speed up the delta table creation?

Spark's default configurations are not designed for small jobs which usually happen in unit tests. Here are the configurations Delta Lake is using in unit tests:

javaOptions in Test ++= Seq(
  "-Dspark.ui.enabled=false",
  "-Dspark.ui.showConsoleProgress=false",
  "-Dspark.databricks.delta.snapshotPartitions=2",
  "-Dspark.sql.shuffle.partitions=5",
  "-Ddelta.log.cacheSize=3",
  "-Dspark.sql.sources.parallelPartitionDiscovery.parallelism=5",
  "-Xmx1024m"
)

You can also apply the same set of configurations to speed up your tests.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM