简体   繁体   中英

Apache Spark: using plain SQL queries vs using Spark SQL methods

I'm very new to Apache Spark. I have a very basic question: what is best in terms of performance between the two syntax below: using plain SQL queries or using Spark SQL methods like select, filter, etc. . Here's a short example in Java, that will make you understand better my question.

    private static void queryVsSparkSQL() throws AnalysisException {
        SparkConf conf = new SparkConf();

        SparkSession spark = SparkSession
                .builder()
                .master("local[4]")
                .config(conf)
                .appName("queryVsSparkSQL")
                .getOrCreate();

        //using predefined query
        Dataset<Row> ds1 = spark
                .read()
                .format("jdbc")
                .option("url", "jdbc:oracle:thin:hr/hr@localhost:1521/orcl")
                .option("user", "hr")
                .option("password", "hr")
                .option("query","select * from hr.employees t where t.last_name = 'King'")
                .load();
        ds1.show();

        //using spark sql methods: select, filter
        Dataset<Row> ds2 = spark
                .read()
                .format("jdbc")
                .option("url", "jdbc:oracle:thin:hr/hr@localhost:1521/orcl")
                .option("user", "hr")
                .option("password", "hr")
                .option("dbtable", "hr.employees")
                .load()
                .select("*")
                .filter(col("last_name").equalTo("King"));

        ds2.show();
    }

Try.explain and check if pushdown predicate is used on your second query.

It should be in that second case. If so, it is equivalent technically in performance to passing the explicit query with pushdown already contained in the query option.

See a simulated version against mySQL, based on your approach.

CASE 1: select statement via passed query containing filter

val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam").option("driver", "org.mariadb.jdbc.Driver").option("query","select * from family where rfam_acc = 'RF01527'").option("user", "rfamro").load().explain()

== Physical Plan ==
*(1) Scan JDBCRelation((select * from family where rfam_acc = 'RF01527') SPARK_GEN_SUBQ_4) [numPartitions=1] #[rfam_acc#867,rfam_id#868,auto_wiki#869L,description#870,author#871,seed_source#872,gathering_cutoff#873,trusted_cutoff#874,noise_cutoff#875,comment#876,previous_id#877,cmbuild#878,cmcalibrate#879,cmsearch#880,num_seed#881L,num_full#882L,num_genome_seq#883L,num_refseq#884L,type#885,structure_source#886,number_of_species#887L,number_3d_structures888,num_pseudonokts#889,tax_seed#890,... 11 more fields] PushedFilters: [], ReadSchema: struct<rfam_acc:string,rfam_id:string,auto_wiki:bigint,description:string,author:string,seed_sour...

Here PushedFilters is not used as a query is only used; it contains the filter in the actual passed to db query.

CASE 2: No select statement, rather using Spark SQL APIs referencing a filter

val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam").option("driver", "org.mariadb.jdbc.Driver").option("dbtable", "family").option("user", "rfamro").load().select("*").filter(col("rfam_acc").equalTo("RF01527")).explain()

== Physical Plan ==
*(1) Scan JDBCRelation(family) [numPartitions=1] [rfam_acc#1149,rfam_id#1150,auto_wiki#1151L,description#1152,author#1153,seed_source#1154,gathering_cutoff#1155,trusted_cutoff#1156,noise_cutoff#1157,comment#1158,previous_id#1159,cmbuild#1160,cmcalibrate#1161,cmsearch#1162,num_seed#1163L,num_full#1164L,num_genome_seq#1165L,num_refseq#1166L,type#1167,structure_source#1168,number_of_species#1169L,number_3d_structures#1170,num_pseudonokts#1171,tax_seed#1172,... 11 more fields] PushedFilters: [*IsNotNull(rfam_acc), *EqualTo(rfam_acc,RF01527)], ReadSchema: struct<rfam_acc:string,rfam_id:string,auto_wiki:bigint,description:string,author:string,seed_sour...

PushedFilter is set to the criteria so filtering is applied in the database itself prior to returning data to Spark. Note the * on the PushedFilters, that signfies filtering at data source = database.

Summary

I ran both options and the timing was quick. They are equivalent in terms of what DB processing is done, only filtered results are returned to Spark, but via two different mechanisms that result in the same performance and results physically.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM