简体繁体中英

Spark sql queries vs dataframe functions

原文 2016-02-05 11:03:52 3 4 sql/ performance/ apache-spark/ dataframe/ apache-spark-sql

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select() .

Any idea? :)

4 answers

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.

They are also portable and can be used without any modifications with every supported language. With HiveContext<\/code> , these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks ( https:\/\/community.hortonworks.com\/articles\/42027\/rdd-vs-dataframe-vs-sparksql.html<\/a> ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.

By using DataFrame, one can break the SQL into multiple statements\/queries, which helps in debugging, easy enhancements and code maintenance.

The only thing that matters is what kind of underlying algorithm is used for grouping. HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

Apache Spark: using plain SQL queries vs using Spark SQL methods

Spark-SQL Window functions on Dataframe - Finding first timestamp in a group

SQL queries in Spark/scala

separate queries of SQL functions?

sql queries : IN vs equal

Comments in Spark SQL with string queries

are nested queries supported in spark sql?

Why don't sql databases allow creating functions for all queries vs constructing the sql strings in the calling program?

Unable to execute nested SQL queries in Spark SQL

Queries - SQL vs. SharePoint

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Apache Spark: using plain SQL queries vs using Spark SQL methods Spark-SQL Window functions on Dataframe - Finding first timestamp in a group SQL queries in Spark/scala separate queries of SQL functions? sql queries : IN vs equal Comments in Spark SQL with string queries are nested queries supported in spark sql? Why don't sql databases allow creating functions for all queries vs constructing the sql strings in the calling program? Unable to execute nested SQL queries in Spark SQL Queries - SQL vs. SharePoint

Related Tags

Spark sql queries vs dataframe functions

Question

4 answers

solution1
25 ACCPTED

solution2
5 2017-07-28 00:00:46

solution3
4 2017-12-26 15:00:10

solution4
1 2020-06-24 02:21:20

Spark sql queries vs dataframe functions

Question

4 answers

solution1 25 ACCPTED

solution2 5 2017-07-28 00:00:46

solution3 4 2017-12-26 15:00:10

solution4 1 2020-06-24 02:21:20

solution1
25 ACCPTED

solution2
5 2017-07-28 00:00:46

solution3
4 2017-12-26 15:00:10

solution4
1 2020-06-24 02:21:20