简体   繁体   English

rdd.count的结果,通过spark sql进行计数是相同的,但它们与配置单元sql的计数不同

[英]Results of rdd.count, count via spark sql are the same, but they are different from count with hive sql

I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why? 我用count来计算RDD的数量,得到13673153,但是将rdd转移到df并插入到蜂巢之后,再次计数,得到13673182,为什么?

  1. rdd.count rdd.count
  2. spark.sql("select count(*) from ...").show() spark.sql(“从...中选择count(*)”)。show()
  3. hive sql: select count(*) from ... 蜂巢SQL:从...选择计数(*)

This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. 这可能是由于基础文件中的数据与为该表配置的蜂巢中的元数据不匹配所致。 Try running: 尝试运行:

MSCK REPAIR TABLE tablename;

in hive, and see if the issue is fixed. 在配置单元中,看看问题是否已解决。 The command updates the partition information of the table. 该命令更新表的分区信息。 You can find more info in the documentation here . 您可以在此处的文档中找到更多信息。

During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. 在Spark Action和SparkContext的一部分期间,Spark将记录哪些文件在处理范围之内。 So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. 因此,如果DAG需要恢复并重新处理该Action,则将获得相同的结果。 By design. 通过设计。

Hive QL has no such considerations. Hive QL没有此类考虑。

UPDATE 更新

As you noted, the other answer did not help in this use case. 如您所述,其他答案在此用例中无济于事。

So, when Spark processes Hive tables it looks at the list of files that it will use for the Action. 因此,当Spark处理Hive表时,它将查看将用于Action的文件列表。

In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. 如果发生故障(节点故障等),它将从生成的DAG重新计算数据。 If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - ie the same files, so that same results are gotten instead of non-deterministic outcomes. 如果需要返回并重新计算,直到从Hive本身开始读取,它就会知道要使用的文件-即相同的文件,从而获得相同的结果,而不是不确定的结果。 Eg think of partitioning aspects, handy that same results can be recomputed! 例如,考虑分区方面,可以方便地重新计算出相同的结果!

It's that simple. 就这么简单。 It's by design. 这是设计使然。 Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM