简体   繁体   中英

Results of rdd.count, count via spark sql are the same, but they are different from count with hive sql

I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?

  1. rdd.count
  2. spark.sql("select count(*) from ...").show()
  3. hive sql: select count(*) from ...

This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:

MSCK REPAIR TABLE tablename;

in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here .

During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.

Hive QL has no such considerations.

UPDATE

As you noted, the other answer did not help in this use case.

So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.

In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - ie the same files, so that same results are gotten instead of non-deterministic outcomes. Eg think of partitioning aspects, handy that same results can be recomputed!

It's that simple. It's by design. Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM