简体   繁体   中英

Disable spark catalyst optimizer

To give some background, I am trying to run TPCDS benchmark on Spark with and without Spark's catalyst optimizer. For complicated queries on smaller datasets, we might be spending more time optimizing the plans than actually executing the plans. Hence wanted to measure the performance impact of optimizers on overall execution of the query

Is there a way to disable some or all of the spark catalyst optimization rules?

This ability has been added as part of Spark-2.4.0 in SPARK-24802 .

val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
    .doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
      "specified by their rule names and separated by comma. It is not guaranteed that all the " +
      "rules in this configuration will eventually be excluded, as some rules are necessary " +
      "for correctness. The optimizer will log the rules that have indeed been excluded.")
    .stringConf
    .createOptional

You could find the list of optimizer rules here .
But ideally, we shouldn't be disabling the rules, since most of them provide performance benefits. We should identify the rule that consumes time and check if is not useful for the query and then disable them.

I know it's not the exact answer but it can help you.

Assuming your driver is not multithreaded. (hint for optimization if Catalyst is slow? :) )

If you want to measure time spent in Catalyst, just go to Spark UI and check how much time your executors are idle, or check the list of stages/jobs.

If you have a Job started at 15:30 with duration 30seconds, and next one starts at 15:32, it probably means catalyst is taking 1:30 to optimize (assuming no driver-heavy work is done).

Or even better, just put logs before calling every action in Spark and then just check how much time passes until the task is actually sent to the executor.

Just for completing, I asked on this pull request SPARK-24802 how to do it and Takeshi Yamamuro kindly answered:

scala> Seq("abc", "def").toDF("v").write.saveAsTable("t")
scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND EndsWith(v#18, bc))
                                    ^^^^^^^^
   +- *(1) ColumnarToRow
      +- FileScan parquet default.t[v#18] ...

scala> sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LikeSimplification")

scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND v#18 LIKE %bc)
                                         ^^^^
   +- *(1) ColumnarToRow
      +- FileScan parquet default.t[v#18] ...

I hope this helps.

You have to turn off the config by

sparkSession.conf.set("spark.sql.cbo.enabled",false)

or

--conf spark.sql.cbo.enabled=false

during triggering spark-submit

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM