Python Spark / Yarn内存使用情况

Question

I have a spark python application that is being killed by yarn for exceeding memory limits. 我有一个火花python应用程序被纱线杀死超过内存限制。 I have a step that involves loading some resources that are a bit heavy (500+ MB), so I'm using mapPartitions. 我有一个步骤涉及加载一些有点重（500+ MB）的资源，所以我正在使用mapPartitions。 Basically: 基本上：

def process_and_output(partition):
    resources = load_resources()
    for record in partition:
        yield transform_record(resources, record)

input = sc.textFile(input_location)
processed = input.mapPartitions(process_and_output)
processed.saveAsTextFile(output_location)

When running, I consistently get this error: 运行时，我一直收到此错误：

ERROR YarnScheduler: Lost executor 1 on (address removed): Container killed by YARN for exceeding memory limits. 错误YarnScheduler：丢失执行程序1（已删除地址）：由于超出内存限制而被YARN杀死的容器。 11.4 GB of 11.2 GB physical memory used. 使用11.4 GB的11.2 GB物理内存。 Consider boosting spark.yarn.executor.memoryOverhead. 考虑提升spark.yarn.executor.memoryOverhead。

I tried boosting memoryOverhead up extremely high, but still the same issue. 我试过将memoryOverhead提升到极高，但仍然是同样的问题。 I ran with: 我跑了：

--conf "spark.python.worker.memory=1200m" \
--conf "spark.yarn.executor.memoryOverhead=5300" \
--conf "spark.executor.memory=6g" \

Surely, that's enough memoryOverhead? 当然，这足够的记忆总结？

I guess more generally, I'm struggling to understand how the python worker's memory is controlled/counted in the overall total. 我想更普遍的是，我很难理解蟒蛇工作者的记忆是如何在总体中控制/计算的。 Is there any documentation of this? 有没有这方面的文件？

I'd also like to understand whether using a generator function will actually cut down on memory usage. 我还想了解使用生成器函数是否会减少内存使用量。 Will it stream data through the python process (like I am hoping) or will it buffer it all before sending back to the JVM/spark infrastructure? 它会通过python进程流式传输数据（就像我希望的那样）还是会在发送回JVM / spark基础设施之前将其缓冲？

Answer 1

Yarn kills executors when its 纱线杀死执行者

memory usage > (executor-memory + executor.memoryOverhead)

From your setting it looks like it is a valid exception. 从您的设置看起来它是一个有效的例外。

(memory usage) 11.4GB > 11.18GB (executor-memory=6GB + memoryOverhead=5.18GB)

try with 尝试

--conf "spark.yarn.executor.memoryOverhead=6144"`

Answer 2

As you see 11.2 GB is your max memory for a container created by yarn. 如您所见，11.2 GB是由纱线创建的容器的最大内存。 It is equal to executor memory + overhead. 它等于执行程序内存+开销。 So python memory is not counted for that. 所以python内存不计算在内。

Exception wants you to increase overhead but you can just increase executor-memory without increasing overheadmemory. 异常希望您增加开销，但您可以在不增加开销内存的情况下增加执行程序内存。 Thats all i can say without knowing why you need that much memory in a single executor, may be a cartesian or something like that can require so much memory. 多数民众赞成我不知道为什么你需要在一个执行器中需要那么多的内存，可能是一个笛卡尔或类似的东西需要这么多的记忆。

Answer 3

Two and a half years later... I happen to be reading spark release notes and see this: 两年半之后......我碰巧正在阅读火花释放笔记并看到这个：

Add spark.executor.pyspark.memory limit 添加spark.executor.pyspark.memory限制

With this linked bug: https://issues.apache.org/jira/browse/SPARK-25004 有这个链接的错误： https ： //issues.apache.org/jira/browse/SPARK-25004

I've long since worked around my original issue and then changed jobs so I no longer have the ability to try this out. 我早就解决了原来的问题然后改变了工作，所以我不再有能力尝试这个。 But I suspect this may have been the exact problem I was having. 但我怀疑这可能是我遇到的确切问题。

Python Spark / Yarn内存使用情况

问题描述

3 个解决方案

解决方案1
5 2016-06-30 12:20:08

解决方案2
3 2016-06-30 08:01:18

解决方案3
1 2019-03-01 23:28:25

Python Spark / Yarn内存使用情况

问题描述

3 个解决方案

解决方案1 5 2016-06-30 12:20:08

解决方案2 3 2016-06-30 08:01:18

解决方案3 1 2019-03-01 23:28:25

解决方案1
5 2016-06-30 12:20:08

解决方案2
3 2016-06-30 08:01:18

解决方案3
1 2019-03-01 23:28:25