HIVE 中 ALTER TABLE 命令中的 CONCATENATE 如何工作

Question

I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.我试图了解 HIVE 工程中的 ALTER TABLE CONCATENATE 究竟是如何工作的。

I saw this link How does Hive 'alter table <table name> concatenate' work?我看到了这个链接Hive 'alter table <table name> concatenate' 如何工作？ but all I got from this links is that for ORC Files, the merge happens at a stripe level.但我从这个链接中得到的是，对于 ORC 文件，合并发生在条带级别。

I am looking for a detailed explanation of how CONCATENATE works.我正在寻找有关 CONCATENATE 如何工作的详细说明。 As an eg I initially had 500 small ORC Files in the HDFS.例如，我最初在 HDFS 中有 500 个小的 ORC 文件。 I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files.我运行了 Hive ALTER TABLE CONCATENATE 并将文件合并为 27 个更大的文件。 Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand CONCATENATE 的后续运行将文件数量减少到 16，最后我得到了两个大文件。（使用 Hive 0.12 版）所以我想了解

How exactly CONCATENATE works?连接到底是如何工作的？ Does it looks at the existing number of files , as well as the size ?它是否查看现有的文件数量以及大小？ How will it determine the no: of output ORC files after concatenation?连接后如何确定输出 ORC 文件的编号？
Is there any known issues with using the Concatenate ?使用 Concatenate 是否存在任何已知问题？ We are planning to run the concatenate one a day in the maintenance window我们计划在维护窗口中每天运行一次连接
Is Using CTAS an alternative to concatenate and which is better?使用 CTAS 是连接的替代方法，哪个更好？ Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read请注意，我的要求是在不影响读取性能的情况下减少 ORC 文件的数量（通过 Nifi 摄取）

Any help is appreciated and thanks in advance任何帮助表示赞赏并提前致谢

Answer 1

Concatenated file size can be controlled with following two values:可以使用以下两个值控制连接文件的大小：

set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;

These values should be set based on your HDFS/MapR-FS block size.这些值应根据您的 HDFS/MapR-FS 块大小设置。

Answer 2

As commented by @leftjoin it is indeed the case that you can get different output files for the same underlying data.正如@leftjoin 所评论的那样，确实可以为相同的基础数据获得不同的输出文件。

This is discussed more in the linked HCC thread but the key point is:这在链接的 HCC 线程中有更多讨论，但关键点是：

Concatenation depends on which files are chosen first.串联取决于首先选择哪些文件。

Note that having files of different sizes, should not be a problem in normal situations.请注意，在正常情况下，拥有不同大小的文件应该不是问题。

If you want to streamline your process, then depending on how big your data is, you may also want to batch it a bit before writing to HDFS.如果您想简化您的流程，那么根据您的数据有多大，您可能还需要在写入 HDFS 之前对其进行批处理。 For instance, by setting the batch size in NiFi.例如，通过在 NiFi 中设置批量大小。

HIVE 中 ALTER TABLE 命令中的 CONCATENATE 如何工作

问题描述

2 个解决方案

解决方案1
1 2021-03-13 10:26:40

解决方案2
0 2019-04-05 10:55:34

HIVE 中 ALTER TABLE 命令中的 CONCATENATE 如何工作

问题描述

2 个解决方案

解决方案1 1 2021-03-13 10:26:40

解决方案2 0 2019-04-05 10:55:34

解决方案1
1 2021-03-13 10:26:40

解决方案2
0 2019-04-05 10:55:34