使用 scala 和 spark-sql 计算表统计信息

Question

I am using Spark 2.4.0 and scala 2.11.12 on my company's bad data environment.我在公司糟糕的数据环境中使用 Spark 2.4.0 和 scala 2.11.12。 In my projects I create many tables with huge amounts of data.在我的项目中，我创建了许多包含大量数据的表。 Now, I want to calculate statistics on the tables I create.现在，我想计算我创建的表的统计信息。

I found the following scala/spark sql statements that should do it:我发现以下 scala/spark sql 语句应该这样做：

// example 1
val res = spark.sql("ANALYZE TABLE mytablename COMPUTE STATISTICS FOR COLUMNS col_name1, col_name2")

// example 2
val res = spark.sql("ANALYZE TABLE mytablename COMPUTE STATISTICS FOR COLUMNS col_name1, col_name2").queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeColumnCommand

// example 3
val res = spark.sql("ANALYZE TABLE mytablename COMPUTE STATISTICS FOR ALL COLUMNS")

// example 4
val res = spark.sql("ANALYZE TABLE mytablename COMPUTE STATISTICS FOR COLUMNS col_not_exists")

In all cases I invalidate the metadata first before I start to check the results.在所有情况下，在开始检查结果之前，我都会先使元数据无效。

In case of exmample // 1 I don't receive any error messages but I also don't see any results in the table stats ("show table stats mytablename").在示例 // 1 的情况下，我没有收到任何错误消息，但我也没有在表 stats 中看到任何结果（“show table stats mytablename”）。 It seems like no calculation has been done for that columns.似乎没有对这些列进行任何计算。 In case of example // 2 I have same results as for //1.在示例 // 2 的情况下，我得到与 //1 相同的结果。 For example // 3 I receive the error message:例如 // 3 我收到错误消息：

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'ALL' expecting <EOF>(line 1, pos 70)

== SQL ==
ANALYZE TABLE mytablename COMPUTE STATISTICS FOR ALL COLUMNS
-------------------------------------------------^^^

In case of the last example // 4 I try to calculate table statistics for a column that does not exist in the table.对于最后一个示例 // 4，我尝试计算表中不存在的列的表统计信息。 Here, I don't receive any error message as expected.在这里，我没有按预期收到任何错误消息。

What is the best practice for simply calculate table statistics with scala 2.11 and Spark 2.4 for all columns or at least for some of them?使用 scala 2.11 和 Spark 2.4 对所有列或至少对其中一些列简单地计算表统计信息的最佳实践是什么？

Answer 1

The support for ALL COLUMNS is since Spark 3.0 as you can see here .从 Spark 3.0开始支持ALL COLUMNS ，如您所见。 Before Spark 3.0 you need to specify the column names for which you want to compute stats.在 Spark 3.0 之前，您需要指定要为其计算统计信息的列名。 Your example 1 should work and if you want to see the computed stats you can run (for the column level stats)您的示例 1 应该可以工作，如果您想查看可以运行的计算统计信息（对于列级统计信息）

DESCRIBE EXTENDED table_name table_col

or just (for table level stats)或只是（对于表级统计信息）

DESCRIBE EXTENDED table_name

There is a col_name statistics with the relevant information.有一个col_name统计与相关信息。 And if you still don't see it for some reason, this might be also helpful:如果由于某种原因您仍然看不到它，这也可能会有所帮助：

refresh table table_name

使用 scala 和 spark-sql 计算表统计信息

问题描述

1 个解决方案

解决方案1
0 2020-07-07 06:27:20

使用 scala 和 spark-sql 计算表统计信息

问题描述

1 个解决方案

解决方案1 0 2020-07-07 06:27:20

解决方案1
0 2020-07-07 06:27:20