优化 Hive 查询。 java.lang.OutOfMemoryError: Java 堆空间/GC 开销限制超出

Question

How can I optimize a query of this form since I keep running into this OOM error?由于我一直遇到此 OOM 错误，如何优化此表单的查询？ Or come up with a better execution plan?还是想出更好的执行计划？ If I removed the substring clause, the query would work fine, suggesting that this takes a lot of memory.如果我删除了 substring 子句，查询将正常工作，这表明这需要大量的 memory。

When the job fails, the beeline output shows the OOM Java heap space.当作业失败时，直线 output 显示 OOM Java 堆空间。 Readings online suggested that I increase export HADOOP_HEAPSIZE but this still results in the error.在线阅读建议我增加export HADOOP_HEAPSIZE但这仍然会导致错误。 Another thing I tried was increasing the hive.tez.container.size and hive.tez.java.opts (tez heap), but still has this error.我尝试的另一件事是增加hive.tez.container.size和hive.tez.java.opts （tez heap），但仍然有这个错误In the YARN logs, there would be GC overhead limit exceeded, suggesting a combination of not enough memory and/or the query plan is extremely inefficient since it can't collect back enough memory.在 YARN 日志中，将超出 GC 开销限制，这表明 memory 不足和/或查询计划效率极低，因为它无法收集足够的 memory。

I am using Azure HDInsight Interactive Query 4.0.我正在使用 Azure HDInsight 交互式查询 4.0。 20 worker node, D13v2 8 core, and 56GB RAM. 20 个工作节点、D13v2 8 核和 56GB RAM。

Source table源表

create external table database.sourcetable(
  a,
  b,
  c,
  ...
  (183 total columns)
  ...
)
PARTITIONED BY ( 
  W string, 
  X int, 
  Y string, 
  Z int
)

Target Table目标表

create external table database.NEWTABLE(
  a,
  b,
  c,
  ...
  (187 total columns)
  ...
  W,
  X,
  Y,
  Z
)
PARTITIONED BY (
  aAAA,
  bBBB
)

Query询问

insert overwrite table database.NEWTABLE partition(aAAA, bBBB, cCCC)
select
a,
b,
c,
...
(187 total columns)
...
W,
X,
Y,
Z,
cast(a as string) as aAAA, 
from_unixtime(unix_timestamp(b,'yyMMdd'),'yyyyMMdd') as bBBB,
substring(upper(c),1,2) as cCCC
from database.sourcetable

Answer 1

If everything else is okay, try to add distribute by partiton key at the end of your query:如果其他一切正常，请尝试在查询末尾添加按分区键分发：

  from database.sourcetable 
  distribute by aAAA, bBBB, cCCC

As a result each reducer will create only one partition file, consuming less memory因此每个reducer只会创建一个分区文件，消耗更少的memory

Answer 2

Try sorting the partitioned columns:尝试对分区列进行排序：

SET hive.optimize.sort.dynamic.partition=true;

When enabled, dynamic partitioning column will be globally sorted.启用后，动态分区列将全局排序。 This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.这样，我们可以为 reducer 中的每个分区值只保持一个记录写入器打开，从而减少 memory 对 reducer 的压力。

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

优化 Hive 查询。 java.lang.OutOfMemoryError: Java 堆空间/GC 开销限制超出

问题描述

Source table源表

Target Table目标表

Query询问

2 个解决方案

解决方案1
1 已采纳 2020-07-09 11:10:40

解决方案2
0 2020-07-10 13:58:49

优化 Hive 查询。 java.lang.OutOfMemoryError: Java 堆空间/GC 开销限制超出

问题描述

Source table源表

Target Table目标表

Query询问

2 个解决方案

解决方案1 1 已采纳 2020-07-09 11:10:40

解决方案2 0 2020-07-10 13:58:49

解决方案1
1 已采纳 2020-07-09 11:10:40

解决方案2
0 2020-07-10 13:58:49