当所有内存设置都设置为巨大时，rdd.collect（）中的java.lang.OutOfMemoryError

Question

I run the following python script with spark-submit, 我使用spark-submit运行以下python脚本，

r = rdd.map(list).groupBy(lambda x: x[0]).map(lambda x: x[1]).map(list)
r_labeled = r.map(f_0).flatMap(f_1)
r_labeled.map(lambda x: x[3]).collect()

It gets java.lang.OutOfMemoryError, specifically on the collect() action of the last line, 它会收到java.lang.OutOfMemoryError，特别是在最后一行的collect（）操作上，

java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
17/11/08 08:27:31 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 6,5,main]
java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
17/11/08 08:27:31 INFO SparkContext: Invoking stop() from shutdown hook
17/11/08 08:27:31 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 6, localhost, executor driver): java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The message says OutOfMemoryError but nothing else. 该消息显示OutOfMemoryError，但没有其他内容。 Is it about heap, garbage collection or anything? 是关于堆，垃圾回收还是其他？ I don't know. 我不知道。

Anyway, I tried to config everything about memory to huge value. 无论如何，我试图将有关内存的所有内容配置为巨大的价值。

spark.driver.maxResultSize = 0 # no limit
spark.driver.memory = 150g
spark.executor.memory = 150g
spark.worker.memory = 150g

(And the server has 157g physical memory available.) （并且服务器具有157g的物理内存。）

Still the same error is there. 仍然存在相同的错误。

Then I reduced the input data a little bit, and the code passed perfectly every time. 然后我稍微减少了输入数据，每次代码都能完美传递。 In fact, the data got by collect() is about 1.8g, far more smaller than the physical 15g memory. 实际上，collect（）获得的数据约为1.8g，远小于物理15g内存。

Now, I am sure the error is not about the code and physical memory is no limit. 现在，我确定该错误与代码无关，并且物理内存没有限制。 It is like there is a threshold for the size of input data, and passing it will cause out-of-memory error. 就像输入数据的大小有一个阈值，传递该阈值将导致内存不足错误。

So how can I lift this thresold so that I can handle bigger input data withou memory error? 那么，如何解除这个阈值，以便在没有内存错误的情况下处理更大的输入数据？ Any settings? 有设置吗？

Thanks. 谢谢。

========== follow up ============ ==========跟进============

According to this , this error is related to Java Serializer and big object in MAP transformation. 根据这个，这个错误是与Java序列化和MAP转型的大对象。 I did use big object in my code. 我在代码中确实使用了大对象。 Wondering how to get Java Serializer accommodate big object. 想知道如何使Java Serializer容纳大对象。

Answer 1

First of all, it makes sense that you only get a problem when you call the collect method. 首先，只有在调用collect方法时才会遇到问题，这才有意义。 Spark is lazy. Spark很懒。 Therefore it does absolutely nothing until you send data to the driver (collect, reduce, count...) or to the disk (write, save...). 因此，在将数据发送到驱动程序（收集，减少，计数...）或磁盘（写入，保存...）之前，它什么都不做。

Then it seems that you get an out of memory exception on an executor. 然后，您似乎在执行程序上遇到了内存不足的异常。 What I understand from the stack trace is that your groupBy is creating an array of a size that exceeds the defined capacity (Integer.MAX_VALUE - 5 according to this ). 我从堆栈跟踪中了解到，您的groupBy正在创建一个数组，该数组的大小超过了已定义的容量（根据此容量，Integer.MAX_VALUE-5）。 Would it be possible that a given key appears more than 2 billion times in your dataset? 给定密钥在您的数据集中是否可能出现超过20亿次的出现？

In any case, I am not sure to understand what you are trying to do, but if you can, try to replace groupBy by a reduce operation that will put less stress on the memory. 无论如何，我不确定您要做什么，但是如果可以的话，请尝试通过减少操作来替换groupBy，以减少对内存的压力。

Finally you gave 150g to the driver and to each executor although you have only 150g for everyone. 最终，尽管每个人只有150克，但您给了驾驶员和每个执行者150克。 I do not know who got what in your case. 我不知道您的情况是谁得到的。 Try to share your memory reasonably and tell us what happens. 尝试合理地分享您的记忆，并告诉我们会发生什么。

Hope this helps. 希望这可以帮助。

当所有内存设置都设置为巨大时，rdd.collect（）中的java.lang.OutOfMemoryError

问题描述

1 个解决方案

解决方案1
0 2017-11-07 20:11:23

当所有内存设置都设置为巨大时，rdd.collect（）中的java.lang.OutOfMemoryError

问题描述

1 个解决方案

解决方案1 0 2017-11-07 20:11:23

解决方案1
0 2017-11-07 20:11:23