“spark.yarn.executor.memoryOverhead”和“spark.memory.offHeap.size”之间的区别

Question

I am running spark on yarn.我在纱线上运行火花。 I don't understand what is the difference between the following settings spark.yarn.executor.memoryOverhead and spark.memory.offHeap.size .我不明白以下设置spark.yarn.executor.memoryOverhead和spark.memory.offHeap.size有什么区别。 Both seem to be settings for allocating off-heap memory to spark executor.两者似乎都是将堆外 memory 分配给 spark executor 的设置。 Which one should I use?我应该使用哪一个？ Also what is the recommended setting for executor offheap memory?另外，执行程序堆外 memory 的推荐设置是什么？

Many thanks!非常感谢！

Answer 1

spark.executor.memoryOverhead is used by resource management like YARN, whereas spark.memory.offHeap.size is used by Spark core (memory manager). spark.executor.memoryOverhead被 YARN 等资源管理使用，而spark.memory.offHeap.size被 Spark 核心（内存管理器）使用。 The relationship a bit different depending on the version.关系因版本而有所不同。

Spark 2.4.5 and before: Spark 2.4.5 及之前版本：

spark.executor.memoryOverhead should include spark.memory.offHeap.size . spark.executor.memoryOverhead应该包括spark.memory.offHeap.size 。 This means that if you specify offHeap.size , you need to manually add this portion to memoryOverhead for YARN.这意味着如果您指定offHeap.size ，则需要手动将此部分添加到 YARN 的memoryOverhead中。 As you can see from the code below from YarnAllocator.scala , when YARN request resource, it does not know anything about offHeap.size :正如您从YarnAllocator.scala的以下代码中看到的那样，当 YARN 请求资源时，它对offHeap.size ：

private[yarn] val resource = Resource.newInstance(
    executorMemory + memoryOverhead + pysparkWorkerMemory,
    executorCores)

However, the behavior is changed in Spark 3.0:但是，Spark 3.0 中的行为发生了变化：

spark.executor.memoryOverhead does not include spark.memory.offHeap.size anymore. spark.executor.memoryOverhead不再包括spark.memory.offHeap.size 。 YARN will include offHeap.size for you when requesting resources. YARN 会在请求资源时为您包含offHeap.size 。 From the newdocumentation :从新文档中：

Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.

And from the code you can also tell:从代码中您还可以看出：

private[yarn] val resource: Resource = {
    val resource = Resource.newInstance(
      executorMemory + executorOffHeapMemory + memoryOverhead + pysparkWorkerMemory, executorCores)
    ResourceRequestHelper.setResourceRequests(executorResourceRequests, resource)
    logDebug(s"Created resource capability: $resource")
    resource
  }

For more details of this change you can refer to this Pull Request .有关此更改的更多详细信息，您可以参考此 Pull Request 。

For your second question, what is the recommended setting for executor offheap memory?对于您的第二个问题，executor offheap memory 的推荐设置是什么？ It depends on your application and you need some testing.这取决于您的应用程序，您需要进行一些测试。 I found this page helpful to explain it further:我发现此页面有助于进一步解释：

Off-heap memory is a great way to reduce GC pauses because it's not in the GC's scope.堆外 memory 是减少 GC 暂停的好方法，因为它不在 GC 的 scope 中。 However, it brings an overhead of serialization and deserialization.但是，它带来了序列化和反序列化的开销。 The latter in its turn makes that the off-heap data can be sometimes put onto heap memory and hence be exposed to GC.后者反过来使得堆外数据有时可以放入堆 memory 并因此暴露给 GC。 Also, the new data format brought by Project Tungsten (array of bytes) helps to reduce the GC overhead.此外，Project Tungsten（字节数组）带来的新数据格式有助于减少 GC 开销。 These 2 reasons make that the use of off-heap memory in Apache Spark applications should be carefully planned and, especially, tested.这两个原因使得在 Apache Spark 应用程序中使用堆外 memory 应该仔细规划，尤其是经过测试。

BTW, spark.yarn.executor.memoryOverhead is deprecated and changed to spark.executor.memoryOverhead , which is common for YARN and Kubernetes.顺便说一句， spark.yarn.executor.memoryOverhead已弃用并更改为spark.executor.memoryOverhead ，这对于 YARN 和 Kubernetes 很常见。

Answer 2

spark.yarn.executor.memoryOverhead is used in StaticMemoryManager. spark.yarn.executor.memoryOverhead在 StaticMemoryManager 中使用。 This is used in older Spark Version like 1.2.这用于较旧的 Spark 版本，如 1.2。

The amount of off heap memory (in megabytes) to be allocated per executor.每个执行程序分配的堆外 memory（以兆字节为单位）的数量。 This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).这是 memory，它考虑了 VM 开销、内部字符串、其他本机开销等。这往往会随着执行程序的大小（通常为 6-10%）而增长。

You can find this in older Spark docs,like Spark1.2 docs:你可以在旧的 Spark 文档中找到这个，比如 Spark1.2 文档：

https://spark.apache.org/docs/1.2.0/running-on-yarn.html

spark.memory.offHeap.size is used in UnifiedMemoryManager, which is used by default after version 1.6在UnifiedMemoryManager中使用spark.memory.offHeap.size ，1.6版本以后默认使用

The absolute amount of memory in bytes which can be used for off-heap allocation.可用于堆外分配的 memory 的绝对数量（以字节为单位）。 This setting has no impact on heap memory usage, so if your executors' total memory consumption must fit within some hard limit then be sure to shrink your JVM heap size accordingly.此设置对堆 memory 的使用没有影响，因此如果您的执行程序的总 memory 消耗必须符合某个硬限制，那么请务必相应地缩小您的 JVM 堆大小。 This must be set to a positive value when spark.memory.offHeap.enabled=true.当 spark.memory.offHeap.enabled=true 时，这必须设置为正值。

You can find this in lates Spark docs,like Spark2.4 docs:你可以在最新的 Spark 文档中找到这个，比如 Spark2.4 文档：

https://spark.apache.org/docs/2.4.4/configuration.html

“spark.yarn.executor.memoryOverhead”和“spark.memory.offHeap.size”之间的区别

问题描述

2 个解决方案

解决方案1
9 2020-05-11 06:22:11

解决方案2
0 2020-02-15 06:50:33

“spark.yarn.executor.memoryOverhead”和“spark.memory.offHeap.size”之间的区别

问题描述

2 个解决方案

解决方案1 9 2020-05-11 06:22:11

解决方案2 0 2020-02-15 06:50:33

解决方案1
9 2020-05-11 06:22:11

解决方案2
0 2020-02-15 06:50:33