简体   繁体   中英

Calculate table statistics using scala and spark-sql

I am using Spark 2.4.0 and scala 2.11.12 on my company's bad data environment. In my projects I create many tables with huge amounts of data. Now, I want to calculate statistics on the tables I create.

I found the following scala/spark sql statements that should do it:

// example 1
val res = spark.sql("ANALYZE TABLE mytablename COMPUTE STATISTICS FOR COLUMNS col_name1, col_name2")

// example 2
val res = spark.sql("ANALYZE TABLE mytablename COMPUTE STATISTICS FOR COLUMNS col_name1, col_name2").queryExecution.logical
import org.apache.spark.sql.execution.command.AnalyzeColumnCommand

// example 3
val res = spark.sql("ANALYZE TABLE mytablename COMPUTE STATISTICS FOR ALL COLUMNS")

// example 4
val res = spark.sql("ANALYZE TABLE mytablename COMPUTE STATISTICS FOR COLUMNS col_not_exists")

In all cases I invalidate the metadata first before I start to check the results.

In case of exmample // 1 I don't receive any error messages but I also don't see any results in the table stats ("show table stats mytablename"). It seems like no calculation has been done for that columns. In case of example // 2 I have same results as for //1. For example // 3 I receive the error message:

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'ALL' expecting <EOF>(line 1, pos 70)

== SQL ==
ANALYZE TABLE mytablename COMPUTE STATISTICS FOR ALL COLUMNS
-------------------------------------------------^^^

In case of the last example // 4 I try to calculate table statistics for a column that does not exist in the table. Here, I don't receive any error message as expected.

What is the best practice for simply calculate table statistics with scala 2.11 and Spark 2.4 for all columns or at least for some of them?

The support for ALL COLUMNS is since Spark 3.0 as you can see here . Before Spark 3.0 you need to specify the column names for which you want to compute stats. Your example 1 should work and if you want to see the computed stats you can run (for the column level stats)

DESCRIBE EXTENDED table_name table_col

or just (for table level stats)

DESCRIBE EXTENDED table_name

There is a col_name statistics with the relevant information. And if you still don't see it for some reason, this might be also helpful:

refresh table table_name

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM