简体   繁体   English

在Tez和Map中运行“ count(*)”时行为上的差异减少

[英]Diffrence in behaviour while running “count(*) ” in Tez and Map reduce

Recently I came across this issue. 最近,我遇到了这个问题。 I had a file at a Hadoop Distributed File System path and related hive table. 我在Hadoop分布式文件系统路径和相关的配置单元表中有一个文件。 The table had 30 partitions on both sides. 桌子两边有30个隔断。

I deleted 5 partitions from HDFS and then executed "msck repair table <db.tablename>;" 我从HDFS删除了5个分区,然后执行"msck repair table <db.tablename>;" on the hive table. 在蜂巢桌上。 It completed fine but outputted 它完成得很好,但是输出了

"Partitions missing from filesystem:" “文件系统缺少分区:”

I tried running select count(*) <db.tablename>; 我尝试运行select count(*) <db.tablename>; (on tez) it failed with the following error: (在tez上)失败,并显示以下错误:

Caused by: java.util.concurrent.ExecutionException: java.io.FileNotFoundException: 引起原因:java.util.concurrent.ExecutionException:java.io.FileNotFoundException:

But when I set hive.execution.engine as "mr" and executed "select count(*) <db.tablename>;" 但是,当我将hive.execution.engine设置为"mr"并执行"select count(*) <db.tablename>;" it worked fine without any issue. 它工作正常,没有任何问题。

I have two questions now : 我现在有两个问题:

  1. How is this is possible? 这怎么可能?

  2. How can I sync the hive metastore and an hdfs partition? 如何同步配置单元metastore和hdfs分区? For the above case .(My hive version is " Hive 1.2.1000.2.6.5.0-292 ".) 对于上述情况。(我的配置单元版本为“ Hive 1.2.1000.2.6.5.0-292”。)

Thanks in advance for help. 在此先感谢您的帮助。

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

This will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. 这会将有关分区的元数据更新到Hive元存储中,以获取尚不存在此类元数据的分区。 The default option for MSC command is ADD PARTITIONS. MSC命令的默认选项是“添加分区”。 With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. 使用此选项,它将把HDFS上存在但元存储中不存在的所有分区添加到元存储中。 The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. DROP PARTITIONS选项将从已经从HDFS中删除的metastore中删除分区信息。 The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. SYNC PARTITIONS选项等效于调用ADD和DROP PARTITIONS。

However, this is available only from Hive version 3.0.. See - HIVE-17824 但是,仅从Hive 3.0版可用。.请参见-HIVE-17824

In your case, the version is Hive 1.2, below are the steps to sync the HDFS Partitions and Table Partitions in Metastore. 在您的情况下,版本为Hive 1.2,以下是同步Metastore中的HDFS分区和表分区的步骤。

  1. Drop the corresponding 5 partitions those have been removed by you from HDFS directly, using the below ALTER statement . 使用下面的ALTER语句,直接删除您从HDFS中删除的相应5个分区。

ALTER TABLE <db.table_name> DROP PARTITION (<partition_column=value>);

  1. Run SHOW PARTITIONS <table_name>; 运行SHOW PARTITIONS <table_name>; and see if the list of partitions are refreshed. 并查看分区列表是否刷新。

This should sync the partitions in HMS as in HDFS. 这应该像在HDFS中一样在HMS中同步分区。

Alternatively, you can drop and recreate the table (IF it is an EXTERNAL table), perform MSCK REPAIR on the newly created table. 或者,您可以删除并重新创建表(如果它是EXTERNAL表), MSCK REPAIR对新创建的表执行MSCK REPAIR Because dropping an external table will not delete the underlying data. 因为删除外部表不会删除基础数据。

Note: By default, MSCK REPAIR will only add newly added partitions in HDFS to Hive Metastore and does not delete the Partitions from Hive Metastore those have been deleted in HDFS manually. 注意:默认情况下, MSCK REPAIR只会将HDFS中新添加的分区添加到Hive Metastore中,而不会从Hive Metastore中删除那些已在HDFS中手动删除的分区。

==== ====

To avoid these steps in future, it is good to delete the partitions directly using ALTER TABLE <table_name> DROP PARTITION (<partition_column=value>) from Hive. 为了避免将来发生这些步骤,最好使用Hive中的ALTER TABLE <table_name> DROP PARTITION (<partition_column=value>)直接删除分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM