Trying to implement DataSourceV2 with SupportsPushDownFilters. Testing it in spark 2.3.1, spark-sql CLI and spark-shell.
Issue: SupportsPushDownFilters.pushFilters is not called when running query from spark-sql (my breakpoints are not hit), but it's called when using DataFrame directly.
My code:
class DefaultSource extends ReadSupport
with DataSourceRegister
with RelationProvider {
def createReader(options: DataSourceOptions) = {
val path = options.get("path").get
val sc = SparkSession.builder.getOrCreate().sparkContext
val conf = sc.hadoopConfiguration
new MyDataSourceReader(path, conf)
}
}
class MyDataSourceReader(path: String, conf:Configuration)
extends DataSourceReader
with SupportsPushDownFilters {
override def pushFilters(filters: Array[Filter]): Array[Filter] = {
println(filters.toList)
filters
}
}
Filters pushed when using DataFrames directly or spark.sql API (note console output has filters printed):
scala> val df=spark.read.format("com.my.spark.datasource.csv2").load("test.csv2")
scala> df.filter("age>24").show
List(IsNotNull(age), GreaterThan(age,24))
+----+---+----------+
|name|age| addr|
+----+---+----------+
| Ann| 25|one st. 12|
|Mary| 27|one st. 14|
+----+---+----------+
scala> df.createOrReplaceTempView("v1")
scala> spark.sql("select * from v1 where age>24").show
List(IsNotNull(age), GreaterThan(age,24))
+----+---+----------+
|name|age| addr|
+----+---+----------+
| Ann| 25|one st. 12|
|Mary| 27|one st. 14|
+----+---+----------+
Filters ARE NOT pushed down when running same query from SQL CLI, (there is nothing to see in CLI output to confirm that, just showing the way queries are executed. My breakpoints are not hit when debugging my datasource):
E:\git\spark-2.3.0>bin\spark-sql
Listening for transport dt_socket at address: 5005
spark-sql> CREATE TEMPORARY VIEW v1 USING com.my.spark.datasource.csv2 OPTIONS
(path "test.csv2");
Time taken: 2.188 seconds
18/08/16 09:46:52 INFO SparkSQLCLIDriver: Time taken: 2.188 seconds
spark-sql> select * from v1 where age>24;
18/08/16 09:47:22 INFO DAGScheduler: Job 0 finished: processCmd at
CliDriver.java:376, took 12.326064 s
Ann 25 one st. 12
Mary 27 one st. 14
Time taken: 13.862 seconds, Fetched 2 row(s)
18/08/16 09:47:22 INFO SparkSQLCLIDriver: Time taken: 13.862 seconds, Fetched 2 row(s)
spark-sql> select * from (select * from v1 where age>24) t1;
Ann 25 one st. 12
Mary 27 one st. 14
Time taken: 0.146 seconds, Fetched 2 row(s)
18/08/16 09:47:37 INFO SparkSQLCLIDriver: Time taken: 0.146 seconds, Fetched 2 row(s)
Debugging spark engine, as far as I can tell the issue is that "calling from spark-sql CLI" path is not producing DataSourceV2Relation in the plan, while it is created in DataFrameReader.load.
Am I missing anything and need to do anything extra to get filter push down on spark-sql? Or is it a known issue?
Browsing spark code for usages of Filter found this PrunedFilteredScan. Turns out it's needed to be implemented to receive filters (buildScan) when doing SQL CLI query:
class DefaultSource extends ReadSupport
with DataSourceRegister
with RelationProvider
{
def createReader(options: DataSourceOptions) = {
val path = options.get("path").get
val sc = SparkSession.builder.getOrCreate().sparkContext
val conf = sc.hadoopConfiguration
new MyDataSourceReader(path, conf)
}
override def shortName(): String = "csv2"
override def createRelation(sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation =
new Csv2Relation(sqlContext, parameters("path"))
}
class Csv2Relation(context:SQLContext, path:String)
extends BaseRelation
with PrunedFilteredScan
{
val _sqlContext = context;
val _path = path;
override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = {
// *** receive filters here when running in SQL CLI
val sc = SparkSession.builder.getOrCreate().sparkContext
val conf = sc.hadoopConfiguration
new DataSourceRDD(sc, new MyDataSourceReader(_path, conf).createDataReaderFactories())
}
}
So there are 2 interfaces need to be implemented to get it working in all scenarios: PrunedFilteredScan + (SupportsPushDownFilters or SupportsPushDownCatalystFilters).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.