简体   繁体   中英

Spark-SQL CLI: SupportsPushDownFilters.pushFilters not called

Trying to implement DataSourceV2 with SupportsPushDownFilters. Testing it in spark 2.3.1, spark-sql CLI and spark-shell.

Issue: SupportsPushDownFilters.pushFilters is not called when running query from spark-sql (my breakpoints are not hit), but it's called when using DataFrame directly.

My code:

class DefaultSource extends ReadSupport
   with DataSourceRegister
   with RelationProvider {

  def createReader(options: DataSourceOptions) = {
     val path = options.get("path").get
     val sc = SparkSession.builder.getOrCreate().sparkContext
     val conf = sc.hadoopConfiguration
     new MyDataSourceReader(path, conf)
  }
}

class MyDataSourceReader(path: String, conf:Configuration)
  extends DataSourceReader
  with SupportsPushDownFilters {

  override def pushFilters(filters: Array[Filter]): Array[Filter] = {
    println(filters.toList)
    filters
  }
}

Filters pushed when using DataFrames directly or spark.sql API (note console output has filters printed):

scala> val df=spark.read.format("com.my.spark.datasource.csv2").load("test.csv2")
scala> df.filter("age>24").show
List(IsNotNull(age), GreaterThan(age,24))
+----+---+----------+
|name|age|      addr|
+----+---+----------+
| Ann| 25|one st. 12|
|Mary| 27|one st. 14|
+----+---+----------+

scala> df.createOrReplaceTempView("v1")
scala> spark.sql("select * from v1 where age>24").show
List(IsNotNull(age), GreaterThan(age,24))
+----+---+----------+
|name|age|      addr|
+----+---+----------+
| Ann| 25|one st. 12|
|Mary| 27|one st. 14|
+----+---+----------+

Filters ARE NOT pushed down when running same query from SQL CLI, (there is nothing to see in CLI output to confirm that, just showing the way queries are executed. My breakpoints are not hit when debugging my datasource):

E:\git\spark-2.3.0>bin\spark-sql
Listening for transport dt_socket at address: 5005
spark-sql> CREATE TEMPORARY VIEW v1 USING com.my.spark.datasource.csv2 OPTIONS 
(path "test.csv2");
Time taken: 2.188 seconds
18/08/16 09:46:52 INFO SparkSQLCLIDriver: Time taken: 2.188 seconds

spark-sql> select * from v1 where age>24;
18/08/16 09:47:22 INFO DAGScheduler: Job 0 finished: processCmd at 
CliDriver.java:376, took 12.326064 s
Ann     25      one st. 12
Mary    27      one st. 14
Time taken: 13.862 seconds, Fetched 2 row(s)
18/08/16 09:47:22 INFO SparkSQLCLIDriver: Time taken: 13.862 seconds, Fetched 2 row(s)

spark-sql> select * from (select * from v1 where age>24) t1;
Ann     25      one st. 12
Mary    27      one st. 14
Time taken: 0.146 seconds, Fetched 2 row(s)
18/08/16 09:47:37 INFO SparkSQLCLIDriver: Time taken: 0.146 seconds, Fetched 2  row(s)

Debugging spark engine, as far as I can tell the issue is that "calling from spark-sql CLI" path is not producing DataSourceV2Relation in the plan, while it is created in DataFrameReader.load.

Am I missing anything and need to do anything extra to get filter push down on spark-sql? Or is it a known issue?

Browsing spark code for usages of Filter found this PrunedFilteredScan. Turns out it's needed to be implemented to receive filters (buildScan) when doing SQL CLI query:

class DefaultSource extends ReadSupport
      with DataSourceRegister
      with RelationProvider
{

  def createReader(options: DataSourceOptions) = {
    val path = options.get("path").get
    val sc = SparkSession.builder.getOrCreate().sparkContext
    val conf = sc.hadoopConfiguration
    new MyDataSourceReader(path, conf)
  }

  override def shortName(): String = "csv2"

  override def createRelation(sqlContext: SQLContext,
                              parameters: Map[String, String]): BaseRelation =
    new Csv2Relation(sqlContext, parameters("path"))
}

class Csv2Relation(context:SQLContext, path:String)
  extends BaseRelation
    with PrunedFilteredScan
{
  val _sqlContext = context;
  val _path = path;

  override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
    // *** receive filters here when running in SQL CLI
    val sc = SparkSession.builder.getOrCreate().sparkContext
    val conf = sc.hadoopConfiguration
    new DataSourceRDD(sc, new MyDataSourceReader(_path, conf).createDataReaderFactories())
  }
}

So there are 2 interfaces need to be implemented to get it working in all scenarios: PrunedFilteredScan + (SupportsPushDownFilters or SupportsPushDownCatalystFilters).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM