简体   繁体   中英

Spark sql queries vs dataframe functions

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select() .

Any idea? :)

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.

  • They are also portable and can be used without any modifications with every supported language. With HiveContext<\/code> , these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks ( https:\/\/community.hortonworks.com\/articles\/42027\/rdd-vs-dataframe-vs-sparksql.html<\/a> ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.

"

By using DataFrame, one can break the SQL into multiple statements\/queries, which helps in debugging, easy enhancements and code maintenance.

The only thing that matters is what kind of underlying algorithm is used for grouping. HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM