[英]Optimize Hive Query. java.lang.OutOfMemoryError: Java heap space/GC overhead limit exceeded
How can I optimize a query of this form since I keep running into this OOM error?由于我一直遇到此 OOM 错误,如何优化此表单的查询? Or come up with a better execution plan?
还是想出更好的执行计划? If I removed the substring clause, the query would work fine, suggesting that this takes a lot of memory.
如果我删除了 substring 子句,查询将正常工作,这表明这需要大量的 memory。
When the job fails, the beeline output shows the OOM Java heap space.当作业失败时,直线 output 显示 OOM Java 堆空间。 Readings online suggested that I increase
export HADOOP_HEAPSIZE
but this still results in the error.在线阅读建议我增加
export HADOOP_HEAPSIZE
但这仍然会导致错误。 Another thing I tried was increasing the hive.tez.container.size
and hive.tez.java.opts
(tez heap), but still has this error.我尝试的另一件事是增加
hive.tez.container.size
和hive.tez.java.opts
(tez heap),但仍然有这个错误In the YARN logs, there would be GC overhead limit exceeded, suggesting a combination of not enough memory and/or the query plan is extremely inefficient since it can't collect back enough memory.在 YARN 日志中,将超出 GC 开销限制,这表明 memory 不足和/或查询计划效率极低,因为它无法收集足够的 memory。
I am using Azure HDInsight Interactive Query 4.0.我正在使用 Azure HDInsight 交互式查询 4.0。 20 worker node, D13v2 8 core, and 56GB RAM.
20 个工作节点、D13v2 8 核和 56GB RAM。
create external table database.sourcetable(
a,
b,
c,
...
(183 total columns)
...
)
PARTITIONED BY (
W string,
X int,
Y string,
Z int
)
create external table database.NEWTABLE(
a,
b,
c,
...
(187 total columns)
...
W,
X,
Y,
Z
)
PARTITIONED BY (
aAAA,
bBBB
)
insert overwrite table database.NEWTABLE partition(aAAA, bBBB, cCCC)
select
a,
b,
c,
...
(187 total columns)
...
W,
X,
Y,
Z,
cast(a as string) as aAAA,
from_unixtime(unix_timestamp(b,'yyMMdd'),'yyyyMMdd') as bBBB,
substring(upper(c),1,2) as cCCC
from database.sourcetable
If everything else is okay, try to add distribute by partiton key at the end of your query:如果其他一切正常,请尝试在查询末尾添加按分区键分发:
from database.sourcetable
distribute by aAAA, bBBB, cCCC
As a result each reducer will create only one partition file, consuming less memory因此每个reducer只会创建一个分区文件,消耗更少的memory
Try sorting the partitioned columns:尝试对分区列进行排序:
SET hive.optimize.sort.dynamic.partition=true;
When enabled, dynamic partitioning column will be globally sorted.
启用后,动态分区列将全局排序。 This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.
这样,我们可以为 reducer 中的每个分区值只保持一个记录写入器打开,从而减少 memory 对 reducer 的压力。
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.