Spark - Get Counts while saving into hive table (ORC)

Question

I would like to ask you if there is any possibility to get the count of DataFrame which I am inserting into Hive Table with usage of saveAsTable() without performance reduction?

Honestly I would like to report log counts or the best would be to get the counts before insert and after insert as that would be really useful information in Splunk Dashboard but I don't want to add hive queries which might harm performance quite significantly as I am having more than 100 Transformations.

Thanks for help in advance!

Answer 1

set hive.stats.autogather=false; - For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default. The user has to explicitly set the boolean variable hive.stats.autogather to false so that statistics are not automatically computed and stored into Hive MetaStore.

Table Level Statistics,

spark.sql("ANALYZE TABLE tableName COMPUTE STATISTICS").show()

which results in

parameters:{totalSize=0, numRows=0, rawDataSize=0...```

Table Partition Level Statistics:

spark.sql("ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS").show()

Note: When the user issues that command, he may or may not specify the partition specs. If the user doesn't specify any partition specs, statistics are gathered for the table as well as all the partitions (if any).

Table Column Level Statistics:

spark.sql("ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS FOR COLUMNS").show()

you can get more details from: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables%E2%80%93ANALYZE

Spark - Get Counts while saving into hive table (ORC)

Question

1 answers

solution1
0 2020-08-04 21:23:11

Spark - Get Counts while saving into hive table (ORC)

Question

1 answers

solution1 0 2020-08-04 21:23:11

solution1
0 2020-08-04 21:23:11