简体   繁体   English

优化 Hive 查询。 java.lang.OutOfMemoryError: Java 堆空间/GC 开销限制超出

[英]Optimize Hive Query. java.lang.OutOfMemoryError: Java heap space/GC overhead limit exceeded

How can I optimize a query of this form since I keep running into this OOM error?由于我一直遇到此 OOM 错误,如何优化此表单的查询? Or come up with a better execution plan?还是想出更好的执行计划? If I removed the substring clause, the query would work fine, suggesting that this takes a lot of memory.如果我删除了 substring 子句,查询将正常工作,这表明这需要大量的 memory。

When the job fails, the beeline output shows the OOM Java heap space.当作业失败时,直线 output 显示 OOM Java 堆空间。 Readings online suggested that I increase export HADOOP_HEAPSIZE but this still results in the error.在线阅读建议我增加export HADOOP_HEAPSIZE但这仍然会导致错误。 Another thing I tried was increasing the hive.tez.container.size and hive.tez.java.opts (tez heap), but still has this error.我尝试的另一件事是增加hive.tez.container.sizehive.tez.java.opts (tez heap),但仍然有这个错误In the YARN logs, there would be GC overhead limit exceeded, suggesting a combination of not enough memory and/or the query plan is extremely inefficient since it can't collect back enough memory.在 YARN 日志中,将超出 GC 开销限制,这表明 memory 不足和/或查询计划效率极低,因为它无法收集足够的 memory。

I am using Azure HDInsight Interactive Query 4.0.我正在使用 Azure HDInsight 交互式查询 4.0。 20 worker node, D13v2 8 core, and 56GB RAM. 20 个工作节点、D13v2 8 核和 56GB RAM。

Source table源表

create external table database.sourcetable(
  a,
  b,
  c,
  ...
  (183 total columns)
  ...
)
PARTITIONED BY ( 
  W string, 
  X int, 
  Y string, 
  Z int
)

Target Table目标表

create external table database.NEWTABLE(
  a,
  b,
  c,
  ...
  (187 total columns)
  ...
  W,
  X,
  Y,
  Z
)
PARTITIONED BY (
  aAAA,
  bBBB
)

Query询问

insert overwrite table database.NEWTABLE partition(aAAA, bBBB, cCCC)
select
a,
b,
c,
...
(187 total columns)
...
W,
X,
Y,
Z,
cast(a as string) as aAAA, 
from_unixtime(unix_timestamp(b,'yyMMdd'),'yyyyMMdd') as bBBB,
substring(upper(c),1,2) as cCCC
from database.sourcetable

If everything else is okay, try to add distribute by partiton key at the end of your query:如果其他一切正常,请尝试在查询末尾添加按分区键分发:

  from database.sourcetable 
  distribute by aAAA, bBBB, cCCC

As a result each reducer will create only one partition file, consuming less memory因此每个reducer只会创建一个分区文件,消耗更少的memory

Try sorting the partitioned columns:尝试对分区列进行排序:

SET hive.optimize.sort.dynamic.partition=true;

When enabled, dynamic partitioning column will be globally sorted.启用后,动态分区列将全局排序。 This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.这样,我们可以为 reducer 中的每个分区值只保持一个记录写入器打开,从而减少 memory 对 reducer 的压力。

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 SPARK SQL java.lang.OutOfMemoryError:超出GC开销限制 - SPARK SQL java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:带有配置单元的 Java 堆空间 - java.lang.OutOfMemoryError: Java heap space with hive Hive:java.lang.OutOfMemoryError:Java堆空间和正在运行的作业(本地Hadoop) - Hive: java.lang.OutOfMemoryError: Java heap space and Job running in-process (local Hadoop) Java HSQLDB-批量插入:OutOfMemoryError:超出了GC开销限制 - Java HSQLDB - Bulk Batch Insert: OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:Java 堆空间(“-Xmx1280M”) - java.lang.OutOfMemoryError: Java heap space (“-Xmx1280M”) java.lang.OutOfMemoryError:批量插入mysql数据库期间Java堆空间错误? - java.lang.OutOfMemoryError: Java heap space error during bulk insert to mysql database? 如何解决这个OutOfMemoryError:Java堆空间 - how to solve this OutOfMemoryError: Java heap space Java.lang.OutOfMemoryError: null 同时写入 BufferedOutputStream - Java.lang.OutOfMemoryError: null while writing on BufferedOutputStream ExecuteQuery:Java 堆空间不足 - ExecuteQuery: Java out of Heap Space 超出Java类文件格式的限制: - Java class file format limit(s) exceeded:
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM