无法通过 Hive Beeline 命令对 ORC 表的行数进行 select 计数

Question

I am using the following components - Hadoop 3.1.4, Hive 3.1.3 and Tez 0.9.2 And there is an ORC table from which I am trying to extract count of the rows in the table.我正在使用以下组件 - Hadoop 3.1.4、Hive 3.1.3 和 Tez 0.9.2 还有一个 ORC 表，我试图从中提取表中的行数。 select count(*) from ORC_TABLE and this throws the below set of exceptions select count(*) from ORC_TABLE并且这会引发以下异常集

Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1670915386694_0182_1_00, diagnostics=[Vertex vertex_1670915386694_0182_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: jio_ar_consumer_events initializer failed, vertex=vertex_1670915386694_0182_1_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:519)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:765)
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)

Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1790)
    ... 17 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
    at org.apache.hadoop.hive.ql.io.AcidUtils.lambda$getAcidState$0(AcidUtils.java:1117)
    at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
    at java.util.TimSort.sort(TimSort.java:220)
    at java.util.Arrays.sort(Arrays.java:1512)
    at java.util.ArrayList.sort(ArrayList.java:1464)
    at java.util.Collections.sort(Collections.java:177)
    at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1115)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:1207)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.access$1500(OrcInputFormat.java:1142)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1179)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1176)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1176)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1142)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more

There is another article where the same problem has been described ORC Split Generation issue with Hive Table but there isnt any solution as such yet.还有另一篇文章描述了同样的问题， ORC Split Generation issue with Hive Table但目前还没有任何解决方案。 I also tried running CONCATENATE function on top of ORC Table but that didn't help either.我还尝试在 ORC 表上运行CONCATENATE function 但这也无济于事。

What works though is, if I run select * from ORC_TABLE with or without LIMIT, it seems to extract the records.但有效的是，如果我select * from ORC_TABLE有或没有 LIMIT，它似乎提取记录。 I reckon issue must only be with aggregate functions or may be I don't get the issue yet.我认为问题一定只与聚合函数有关，或者可能是我还没有得到问题。

I am also using Spark 3.3.1 and I can extract the same count through Spark Context Spark Sql utility and able to fetch the rows as well.我也在使用 Spark 3.3.1，我可以通过 Spark Context Spark Sql 实用程序提取相同的计数，并且也能够获取行。 No issues with Spark in that front. Spark 在这方面没有问题。

Adding on to it, When I change the execution engine to MR, then this works.除此之外，当我将执行引擎更改为 MR 时，这就可以了。 Fails only when I run this on Tez Engine.仅当我在 Tez 引擎上运行时失败。

Any leads to resolve this issue is much appreciated.非常感谢解决此问题的任何线索。

Answer 1

The issue was resolved by the below steps based my previous analysis:根据我之前的分析，通过以下步骤解决了该问题：

This class org.apache.hadoop.fs.FileStatus comes as a part of hadoop common jar file.这个 class org.apache.hadoop.fs.FileStatus作为 hadoop 公共 jar 文件的一部分。

We were using Hadoop 3.1.4 & Tez 0.9.2我们使用的是 Hadoop 3.1.4 和 Tez 0.9.2

Tez 0.9.2 contains a tez.tar.gz that needs to be placed onto HDFS location. Tez 0.9.2 包含一个tez.tar.gz ，需要放在 HDFS 位置。 This tez.tar.gz contained hadoop-common-2.7.2.jar (This does not have the method compareTo that is thrown as an exception as shown in the error )这个tez.tar.gz包含 hadoop-common-2.7.2.jar（它没有作为异常抛出的方法compareTo ，如错误所示）

Solution:解决方案：

We extracted the tez.tar.gz and replaced all hadoop 2.7.2 related jars with hadoop 3.1.4 jars. Do this if you dont want to reconfigure again with new tez version.我们提取了tez.tar.gz并将所有与 hadoop 2.7.2 相关的 jars 替换为 hadoop 3.1.4 jars。如果您不想使用新的 tez 版本再次重新配置，请执行此操作。 Otherwise you could follow solution 2 as mentioned.否则，您可以按照提到的解决方案 2 进行操作。

Recreated the tar and placed it across all dependent locations including HDFS as well.重新创建 tar 并将其放置在所有相关位置，包括 HDFS。 For us it was in /user/tez/share/tez.tar.gz location.对我们来说，它位于/user/tez/share/tez.tar.gz位置。 It changes accordingly.它相应地改变。

This error disappeared after I followed the steps and now I am able to do count of records on any table.按照这些步骤操作后，此错误消失了，现在我可以对任何表上的记录进行计数。

Solution 2: Other solution that you could easily do is, use 0.10.x Tez version that contains libraries for hadoop 3.x version.解决方案 2：您可以轻松执行的其他解决方案是，使用包含 hadoop 3.x 版本库的 0.10.x Tez 版本。 Rather than 0.9.2 Tez version which is compatible with hadoop 2.7.x version.而不是与 hadoop 2.7.x 版本兼容的 0.9.2 Tez 版本。

无法通过 Hive Beeline 命令对 ORC 表的行数进行 select 计数

问题描述

1 个解决方案

解决方案1
0 2022-12-26 12:26:06

无法通过 Hive Beeline 命令对 ORC 表的行数进行 select 计数

问题描述

1 个解决方案

解决方案1 0 2022-12-26 12:26:06

解决方案1
0 2022-12-26 12:26:06