简体繁体 English

"Spark sql 查询与数据框函数"

[英]Spark sql queries vs dataframe functions

原文 2016-02-05 11:03:52 7 4 sql/ performance/ apache-spark/ dataframe/ apache-spark-sql

To perform good performance with Spark.使用 Spark 执行良好的性能。 I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select() .我想知道通过SQLContext使用 sql 查询是否更好，或者通过df.select()等 DataFrame 函数进行查询是否更好。

Any idea?任何的想法？ :) :)

4 个解决方案

There is no performance difference whatsoever.没有任何性能差异。 Both methods use exactly the same execution engine and internal data structures.两种方法都使用完全相同的执行引擎和内部数据结构。 At the end of the day, all boils down to personal preferences.归根结底，一切都归结为个人喜好。

Arguably DataFrame<\/code> queries are much easier to construct programmatically and provide a minimal type safety.可以说DataFrame<\/code>查询更容易以编程方式构建并提供最小的类型安全性。
<\/li>
Plain SQL queries can be significantly more concise and easier to understand.普通的 SQL 查询可以更加简洁和易于理解。 They are also portable and can be used without any modifications with every supported language.它们也是可移植的，无需对每种支持的语言进行任何修改即可使用。 With HiveContext<\/code> , these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).使用HiveContext<\/code> ，这些也可用于公开一些其他方式无法访问的功能（例如，没有 Spark 包装器的 UDF）。
<\/li><\/ul>"

Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same.理想情况下，Spark 的催化剂应该优化对相同执行计划的两个调用，并且性能应该相同。 How to call is just a matter of your style.如何打电话只是你的风格问题。 In reality, there is a difference accordingly to the report by Hortonworks ( https:\/\/community.hortonworks.com\/articles\/42027\/rdd-vs-dataframe-vs-sparksql.html<\/a> ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.实际上，与 Hortonworks 的报告 ( https:\/\/community.hortonworks.com\/articles\/42027\/rdd-vs-dataframe-vs-sparksql.html<\/a> ) 的报告有所不同，其中 SQL 在您的情况下优于 Dataframes需要 GROUPed 记录，其总计数按记录名称排序。

By using DataFrame, one can break the SQL into multiple statements\/queries, which helps in debugging, easy enhancements and code maintenance.通过使用 DataFrame，可以将 SQL 分解为多个语句\/查询，这有助于调试、轻松增强和代码维护。

Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.将复杂的 SQL 查询分解为更简单的查询并将结果分配给 DF 会带来更好的理解。

By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique\/close-to-unique key).通过将查询拆分为多个 DF，开发人员获得了使用缓存、修复（使用唯一\/接近唯一键在分区之间均匀分布数据）的优势。

The only thing that matters is what kind of underlying algorithm is used for grouping.唯一重要的是使用什么样的底层算法进行分组。 HashAggregation would be more efficient than SortAggregation. HashAggregation 会比 SortAggregation 更有效。 SortAggregation - Will sort the rows and then gather together the matching rows. SortAggregation - 将对行进行排序，然后将匹配的行聚集在一起。 O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. O(n*log n) HashAggregation 使用键作为分组列创建 HashMap，其余列作为 Map 中的值。 Spark SQL uses HashAggregation where possible(If data for value is mutable). Spark SQL 尽可能使用 HashAggregation（如果值的数据是可变的）。 O(n)在）