简体   繁体   English

纱线容器的内存不足

[英]Yarn container is running out of memory

My yarn container is running out of memory: This specific container runs an Apache-Spark driver node. 我的纱线容器内存不足:这个特定的容器运行一个Apache-Spark驱动程序节点。

The part I don't understand: I am limiting my driver's heap size to 512MB (you can see this in the error message below). 我不了解的部分:我将驱动程序的堆大小限制为512MB(您可以在下面的错误消息中看到此信息)。 But the yarn container is complaining about memory>1GB (Also see message below). 但是纱线容器抱怨内存> 1GB(另请参阅下面的消息)。 You can validate that yarn is launching java is run with Xmx512m. 您可以验证yarn正在启动java是否与Xmx512m一起运行。 My containers are setup for 1GB memory with .5GB increments. 我的容器设置为1GB内存,增量为.5GB。 Also my physical machines hosting the yarn containers have 32GB each. 另外,我存放纱线容器的物理机每个都有32GB。 I SSH'ed to one of the physical machines and saw that it had alot of free memory... 我通过SSH连接到其中一台物理计算机,发现它有很多可用内存...

Another strange thing, is that java is not throwing OutOfMemory exceptions. 另一个奇怪的事情是,java没有抛出OutOfMemory异常。 When I look at the driver logs, I see that eventually it gets a SIGTERM from yarn, and shuts down nicely. 当我查看驱动程序日志时,我看到最终它从yarn中得到了SIGTERM,并很好地关闭了。 If the java process inside Yarn was going over 512MB shouldn't I have gotten an OutOfMemory exception in Java before it ever tried to allocate 1GB from yarn? 如果Yarn中的Java进程超过512MB,在尝试从yarn分配1GB之前,我是否应该在Java中获得OutOfMemory异常?

I also tried running with a 1024m heap. 我还尝试使用1024m堆运行。 That time, container crashed with a usage of 1.5GB. 那时,容器以1.5GB的容量崩溃了。 This happened consistantly. 这是一致发生的。 So clearly the container had the capacity to allocate another 0.5GB beyond the 1GB limit. 因此,很明显,容器有能力分配超出1GB限制的另外0.5GB。 (quite Logical since the physical machine has 30GB of free memory) (由于物理机具有30GB的可用内存,所以非常逻辑)

Is there something else inside the YARN container beside java which could be taking up the extra 512MB? 在Java旁边的YARN容器中还有其他东西,可能占用了额外的512MB?

Im running CDH 5.4.1 with Apache spark on Yarn. 我在纱线上运行带有Apache Spark的CDH 5.4.1。 The java version on the cluster was also upgraded to oracle Java8. 集群上的Java版本也已升级到oracle Java8。 I saw some people claiming that the default maxPermSize in java8 has been changed, but I hardly believe that it could take up 512MB... 我看到有人声称Java8中的默认maxPermSize已更改,但我几乎不相信它会占用512MB ...

Yarn error message: 毛线错误消息:

Diagnostics: Container [pid=23335,containerID=container_1453125563779_0160_02_000001] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1453125563779_0160_02_000001 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 23335 23333 23335 23335 (bash) 1 0 11767808 432 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hadoop/lib/native::/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hadoop/lib/native /usr/lib/jvm/java-8-oracle/bin/java -server -Xmx512m -Djava.io.tmpdir=/var/yarn/nm/usercache/hdfs/appcache/application_1453125563779_0160/container_1453125563779_0160_02_000001/tmp '-Dspark.eventLog.enabled=true' '-Dspark.executor.memory=512m' '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar' '-Dspark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hadoop/lib/native' '-Dspark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hadoop/lib/native' '-Dspark.shuffle.service.enabled=true' '-Dspark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.1-hadoop2.6.0-cdh5.4.1.jar' '-Dspark.app.name=not_telling-1453479057517' '-Dspark.shuffle.service.port=7337' '-Dspark.driver.extraClassPath=/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.yarn.historyServer.address=http://XXXX-cdh-dev-cdh-node2:18088' '-Dspark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hadoop/lib/native' '-Dspark.eventLog.dir=hdfs://XXXX-cdh-dev-cdh-node1:8020/user/spark/applicationHistory' '-Dspark.master=yarn-cluster' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1453125563779_0160/container_1453125563779_0160_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'not_telling' --jar file:/home/cloud-user/temp/not_telling.jar --arg '--conf' --arg 'spark.executor.extraClasspath=/opt/cloudera/parcels/CDH/jars/htrace-core-3.0.4.jar' --executor-memory 512m --executor-cores 4 --num-executors  10 1> /var/log/hadoop-yarn/container/application_1453125563779_0160/container_1453125563779_0160_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1453125563779_0160/container_1453125563779_0160_02_000001/stderr 
    |- 23338 23335 23335 23335 (java) 95290 10928 2786668544 261830 /usr/lib/jvm/java-8-oracle/bin/java -server -Xmx512m -Djava.io.tmpdir=/var/yarn/nm/usercache/hdfs/appcache/application_1453125563779_0160/container_1453125563779_0160_02_000001/tmp -Dspark.eventLog.enabled=true -Dspark.executor.memory=512m -Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar -Dspark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hadoop/lib/native -Dspark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hadoop/lib/native -Dspark.shuffle.service.enabled=true -Dspark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.1-hadoop2.6.0-cdh5.4.1.jar -Dspark.app.name=not_tellin-1453479057517 -Dspark.shuffle.service.port=7337 -Dspark.driver.extraClassPath=/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.yarn.historyServer.address=http://XXXX-cdh-dev-cdh-node2:18088 -Dspark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hadoop/lib/native -Dspark.eventLog.dir=hdfs://XXXX-cdh-dev-cdh-node1:8020/user/spark/applicationHistory -Dspark.master=yarn-cluster -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1453125563779_0160/container_1453125563779_0160_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class not_telling --jar file:not_telling.jar --arg --conf --arg spark.executor.extraClasspath=/opt/cloudera/parcels/CDH/jars/htrace-core-3.0.4.jar --executor-memory 512m --executor-cores 4 --num-executors 10 

Your application is being killed for virtual memory usage (notice, the 2.6 out of 2.1GB used message). 您的应用程序因使用虚拟内存而被杀死(请注意,已使用的2.1GB消息中有2.6消息)。

A couple options that could help: 几个选项可能会有所帮助:

  1. Disable virtual memory checks in yarn-site.xml by changing "yarn.nodemanager.vmem-check-enabled" to false. 通过将“ yarn.nodemanager.vmem-check-enabled”更改为false来禁用yarn-site.xml中的虚拟内存检查。 This is pretty frequently done, this is usually what I do to be honest. 这是经常要做的,老实说,这通常是我要做的。
  2. Increase "spark.yarn.executor.memoryOverhead" and "spark.yarn.driver.memoryOverhead" until your job stops getting killed. 增加“ spark.yarn.executor.memoryOverhead”和“ spark.yarn.driver.memoryOverhead”,直到您的工作停止被杀死为止。

The reasoning for this is because YARN places a limit on the amount of off heap memory your process is allowed to use. 这样做的原因是因为YARN限制了允许您的进程使用的堆外内存量。 If your application has a ton of executable code (large perm gen in java 7 or earlier) you will hit this limit pretty quickly. 如果您的应用程序有大量的可执行代码(java 7或更早版本中的perm gen较大),您将很快达到此限制。 You're also pretty likely to hit it if you use pyspark where off-heap memory is pretty frequently used. 如果您使用pyspark(经常使用堆外内存),也很可能会遇到问题。

Check out this article , it has a great description. 查看这篇文章 ,它有一个很好的描述。 You might want to note where they're saying "Be aware of the max (7%, 384m) overhead off-heap memory when calculating the memory for executors." 您可能要注意他们在说“在为执行程序计算内存时,请注意最大(7%,384m)的堆外开销”。

Edit (by Eshalev): I'm Accepting this answer, and elaborating on what was found. 编辑(作者:Eshalev):我接受这个答案,并详细说明发现的内容。 Java8 uses a different memory scheme. Java8使用不同的内存方案。 Specifically CompressedClasses reserve 1024MB in the "Metaspace". 特别是CompressedClasses在“元空间”中保留1024MB。 This is much larger than what previous versions of java would allocate in "perm-gen" memory. 这比以前的Java版本在“ perm-gen”内存中分配的要大得多。 You can use "jmap -heap [pid]" to examine this. 您可以使用“ jmap -heap [pid]”进行检查。 We currently keep the app from crashing by over-allocating 1024MB beyond our heap requirements. 当前,我们通过超出堆要求的过度分配1024MB来防止应用崩溃。 This is wastefull, but it keeps the app from crashing. 这很浪费,但是可以防止应用崩溃。

unless you're dealing with very few lines of data you won't go far with 1GB memory per executor. 除非您要处理的数据很少,否则每个执行器只有1GB的内存就不会浪费太多时间。

the best way to calculate the correct ressources you can use is this way: take the nb of cpu and memory you have on 1 node , leave from 1 to 4 cpu cores for the system hdfs (1 core in case of 4core node and 4 cores if you have 32 cores node) divide by 2 to 5 (at least 2 to have multitask with broadcasted data and don't go over 5 as you will face bad hdfs IO bandwidth) and you will get the number of executor you can have on one node. 计算可以使用的正确资源的最佳方法是这种方式:取1个节点上拥有的cpu和内存的nb值,为系统hdfs保留1到4个cpu核心(如果是4core节点则为1核心,而4核心为如果您有32个核心节点)除以2到5(至少2则具有广播数据的多任务功能,并且不要超过5,因为您将面临糟糕的hdfs IO带宽),那么您将获得执行器的数量一个节点。 now take the amount of ram for this node, look the maximum taht yarn allows you for all containers in one node (that should be near 26 GB for your case) and divide it by the number of executor calculated before. 现在,获取该节点的ram数量,查看一个节点中允许所有容器使用的最大taht纱线(对于您的情况,该容器应接近26 GB),然后除以之前计算的执行程序数量。 remove 10% and you got the amount or memory for one executor. 删除10%,您就会得到一名执行者的金额或记忆。

Set manually the spark.yarn.executor.memoryOverhead to 10% of the executor memory as HDP or CDH might force it to 384MB wich is the minimum value. 手动将spark.yarn.executor.memoryOverhead设置为执行程序内存的10%,因为HDP或CDH可能会将其强制为384MB,这是最小值。

now for the number of instances, multiply the number of executor X number of nodes and remove 1 for the driver (and yes you should raise the amount of memory and cpu for the driver the same way) 现在,将实例数量乘以执行程序数量X节点数量,并为驱动程序删除1(是的,您应该以相同的方式提高驱动程序的内存和cpu数量)

so for example i have 3 nodes on aws R4.8xlarge each with 32 cpu and 244GB memory and that allows me to have 20 executors each with 4 cpu and 26 GB memory 因此,例如,我在aws R4.8xlarge上有3个节点,每个节点具有32 cpu和244GB内存,这使我可以有20个执行程序,每个节点具有4 cpu和26 GB内存

spark.executor.memory=26g
spark.yarn.executor.memoryOverhead=2600
spark.driver.memory=26g
spark.yarn.driver.memoryOverhead=2600
spark.executor.cores=4
spark.executor.instances=20
spark.driver.cores=4

after that you may have to tune according to your configuration, for example you may reduce the number of executors to allow them to have more memory. 之后,您可能必须根据您的配置进行调整,例如,您可以减少执行程序的数量,以使它们有更多的内存。

Sometimes the problem is that your RDD is not partitioned equally. 有时问题是您的RDD分配不均。 You may also try increasing partitioning (by doing coalesce or repartition, you may also use partitionBy) on each/some transformation you make. 您还可以尝试在进行的每个/某些转换上增加分区(通过合并或重新分区,也可以使用partitionBy)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM