简体   繁体   English

当从HDFS手动删除分区数据时,如何在Hive中更新分区元数据

[英]How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

What is the way to automatically update the metadata of Hive partitioned tables? 自动更新Hive分区表的元数据的方法是什么?

If new partition data's were added to HDFS (without alter table add partition command execution) . 如果将新分区数据添加到HDFS(没有alter table add partition命令执行)。 then we can sync up the metadata by executing the command 'msck repair'. 然后我们可以通过执行命令'msck repair'来同步元数据。

What to be done if a lot of partitioned data were deleted from HDFS (without the execution of alter table drop partition commad execution). 如果从HDFS中删除了大量分区数据(没有执行alter table drop partition commad执行),该怎么办?

What is the way to syncup the Hive metatdata? 同步Hive metatdata的方法是什么?

EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax : 编辑 :从Hive 3.0.0开始 MSCK现在可以使用以下语法发现新分区或删除丢失的分区(或两者):

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS]

This was implemented in HIVE-17824 这是在HIVE-17824中实施的


As correctly stated by HakkiBuyukcengiz , MSCK REPAIR doesn't remove partitions if the corresponding folder on HDFS was manually deleted, it only adds partitions if new folders are created . 正如HakkiBuyukcengiz正确陈述的那样 ,如果HDFS上的相应文件夹被手动删除, MSCK REPAIR不会删除分区,只有在创建新文件夹时才会添加分区。

Extract from offical documentation : 从官方文档中提取:

In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore. 换句话说,它将添加HDFS上存在但不在Metastore中的任何分区到Metastore。

This is what I usually do in the presence of external tables if multiple partitions folders are manually deleted on HDFS and I want to quickly refresh the partitions : 如果在HDFS上手动删除多个分区文件夹并且我想快速刷新分区,这就是我通常在external表存在时所做的事情:

  • Drop the table ( DROP TABLE table_name ) (dropping an external table does not delete the underlying partition files) 删除表( DROP TABLE table_name )(删除外部表不会删除底层分区文件)
  • Recreate the table ( CREATE EXTERNAL TABLE table_name ... ) 重新创建表( CREATE EXTERNAL TABLE table_name ...
  • Repair it ( MSCK REPAIR TABLE table_name ) 修复它( MSCK REPAIR TABLE table_name

Depending on the number of partitions this can take a long time. 根据分区数量,这可能需要很长时间。 The other solution is to use ALTER TABLE DROP PARTITION (...) for each deleted partition folder but this can be tedious if multiple partitions were deleted. 另一个解决方案是对每个已删除的分区文件夹使用ALTER TABLE DROP PARTITION (...) ,但如果删除了多个分区,这可能会很繁琐。

尝试使用

MSCK REPAIR TABLE <tablename>;

Ensure the table is set to external, drop all partitions then run the table repair: 确保将表设置为external,删除所有分区,然后运行表修复:

alter table mytable_name set TBLPROPERTIES('EXTERNAL'='TRUE')
alter table mytable_name drop if exists partition (`mypart_name` <> 'null');
msck repair table mytable_name;


If msck repair throws an error, then run hive from the terminal as: 如果msck修复抛出错误,则从终端运行配置单元:
hive --hiveconf hive.msck.path.validation=ignore
or set hive.msck.path.validation=ignore; 或者set hive.msck.path.validation=ignore;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM