简体   繁体   English

hive alter table concatenate 命令风险

[英]hive alter table concatenate command risks

I have been using tez engine to run map reduce jobs.我一直在使用 tez 引擎来运行 map 减少作业。 I have a MR job which takes ages to run, because i noticed i have over 20k files with 1 stripe each, and tez does not evenly distributes mappers based on amount of files, rather amount of stripes.我有一个需要很长时间才能运行的 MR 作业,因为我注意到我有超过 20k 个文件,每个文件都有 1 个条带,并且 tez 并没有根据文件数量而不是条带数量均匀分布映射器。 And i can have a bunch of mappers with 1 file but a lot of stripes, and some mappers processing 15k files but with same amount of stripes than the other one.而且我可以拥有一堆带有 1 个文件但有很多条纹的映射器,并且一些映射器处理 15k 文件但具有与另一个相同数量的条纹。

As a workaround test, i used ALTER TALE table PARTITION (...) CONCATENATE in order to bring down the amount of files to process into more evenly distributed stripes per files, and now the map job runs perfectly fine.作为一种变通方法测试,我使用了ALTER TALE table PARTITION (...) CONCATENATE以减少要处理的文件数量,每个文件的条带分布更均匀,现在 map 作业运行良好。

My concern is that i didnt find in the documentation if there are any risks in running this command and losing data, since it works on the same files.我担心的是,如果运行此命令和丢失数据存在任何风险,我没有在文档中找到,因为它适用于相同的文件。

Im trying to assess if its better to use concatenate to bring down the amount of files before the MR job versus using bucketing which reads files and drops bucketed output into a separate location.我试图评估在 MR 作业之前使用 concatenate 来减少文件数量是否更好,而不是使用读取文件并将 bucketed output 放入单独的位置的存储桶。 Which in case of failure i dont lose source data.如果发生故障,我不会丢失源数据。

Concatenate takes 1 minute per partition, versus bucketing taking more time but not risking losing source data.连接每个分区需要 1 分钟,而分桶需要更多时间但不会冒丢失源数据的风险。

My question: is there any risk of data loss when running concatenate command?我的问题:运行连接命令时是否存在数据丢失的风险?

thanks!谢谢!

It should work as safe as rewriting the table from query.它应该与从查询中重写表一样安全。 It uses the same mechanism: result is prepared in staging first, after that staging moved to the table or partition location.它使用相同的机制:首先在 staging 中准备结果,然后将 staging 移动到表或分区位置。

Concatenation works as a separate MR job, prepares concatenated files in staging directory and only if everything went without errors , moves them to the table location.串联作为单独的 MR 作业工作,在暂存目录中准备串联文件,并且只有在一切顺利的情况下,才将它们移动到表位置。 You shold see something like this in logs:您会在日志中看到类似的内容:

INFO  : Loading data to table dbname.tblName partition (bla bla) from /apps/hive/warehouse/dbname.db/tblName/bla bla partition path/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM