OutOfMemoryError：Spark中的Java堆空间和内存变量

Question

I have been trying to execute a scala program and the output somehow always seems to be something like this: 我一直在尝试执行一个scala程序，并且输出总是以某种方式看起来像这样：

15/08/17 14:13:14 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:64)
at java.lang.StringBuilder.<init>(StringBuilder.java:97)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:339)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2344)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:44)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:169)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:34)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1215)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)

or like this 或像这样

15/08/19 11:45:11 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:526)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:505)
    at com.fasterxml.jackson.databind.ObjectMapper._serializerProvider(ObjectMapper.java:2846)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)

Are these errors on the driver or executor side? 这些错误是在驱动程序还是执行程序方面？

I am a bit confused with the memory variables that Spark uses. 我对Spark使用的内存变量有些困惑。 My current settings are 我当前的设置是

spark-env.sh 星火环境

export SPARK_WORKER_MEMORY=6G
export SPARK_DRIVER_MEMORY=6G
export SPARK_EXECUTOR_MEMORY=4G

spark-defaults.conf spark-defaults.conf

# spark.driver.memory              6G
# spark.executor.memory            4G
# spark.executor.extraJavaOptions  ' -Xms5G -Xmx5G '
# spark.driver.extraJavaOptions   ' -Xms5G -Xmx5G '

Do I need to uncomment any of the variables contained in spark-defaults.conf, or are they redundant? 我是否需要取消注释spark-defaults.conf中包含的任何变量，或者它们是否多余？

Is for example setting SPARK_WORKER_MEMORY equivalent to setting the spark.executor.memory ? 例如，设置SPARK_WORKER_MEMORY等效于设置spark.executor.memory ？

Part of my scala code where it stops after a few iterations: 我的scala代码的一部分在几次迭代后停止了：

   val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect
    for (id <- filteredNodesGroups){
        val clusterGraph = connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id)
        val pagerankGraph = clusterGraph.pageRank(0.15)
        val completeClusterPagerankGraph = clusterGraph.outerJoinVertices(pagerankGraph.vertices) {
            case (uid, attrList, Some(pr)) => 
                attrList :+ ("inClusterPagerank:" + pr)
            case (uid, attrList, None) => 
                attrList :+ ""
        }
        val sortedClusterNodes = completeClusterPagerankGraph.vertices.toArray.sortBy(_._2(pagerankIndex + 1))
       println(sortedClusterNodes(0)._2(1) + " with rank: " + sortedClusterNodes(0)._2(pagerankIndex + 1))

     }

Many questions disguised as one. 许多问题伪装成一个问题。 Thank you in advance! 先感谢您！

Answer 1

I'm not a Spark expert, but there is line that seems suspicious to me : 我不是Spark专家，但是有一行对我来说似乎可疑：

val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect

Basically, by using the collect method, you are getting back all the data from your executors (before even processing it) to the driver. 基本上，通过使用collect方法，您会将所有数据从执行程序（甚至在处理之前）取回给驱动程序。 Do you have any idea about the size of this data ? 您对这些数据的大小有任何想法吗？

In order to fix this, you should proceed in a more functional way. 为了解决此问题，您应该以更具功能性的方式进行。 To extract the distinct values, you could for example use a groupBy and map : 要提取不同的值，您可以例如使用groupBy和map：

val pairs = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }
pairs.groupBy(_./* the property to group on */)
     .map { case (_, arrays) => /* map function */ }

Regarding the collect, there should be a way to sort each partition and then to return the (processed) result to the driver. 关于收集，应该有一种方法可以对每个分区进行排序，然后将（处理后的）结果返回给驱动程序。 I would like to help you more but I need more information about what you are trying to do. 我想为您提供更多帮助，但我需要更多有关您要做什么的信息。

UPDATE 更新

After digging a little bit, you could sort your data using shuffling as described here 挖一点点后，你可以在你的数据使用洗牌所描述的排序这里

UPDATE 更新

So far, I've tried to avoid the collect, and to get the data back to the driver as much as possible, but I've no idea how to solve this : 到目前为止，我已尝试避免收集，并尽可能将数据返回驱动程序，但我不知道如何解决此问题：

val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct()
val clusterGraphs = filteredNodesGroups.map { id => connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id) }
val pageRankGraphs = clusterGraphs.map(_.pageRank(0.15))

Basically, you need to join two RDD[Graph[Array[String], String]], but I don't know what key to use, and secondly this would necessarily return an RDD of RDD (I don't know if you can even do that). 基本上，您需要连接两个RDD [Graph [Array [Array [String]，String]]，但我不知道要使用什么键，其次，这必然会返回RDD的RDD（我不知道是否可以甚至这样做）。 I'll try to find something later this day. 我将在今天晚些时候尝试找到一些东西。

OutOfMemoryError：Spark中的Java堆空间和内存变量

问题描述

1 个解决方案

解决方案1
0 2015-08-17 14:31:41

OutOfMemoryError：Spark中的Java堆空间和内存变量

问题描述

1 个解决方案

解决方案1 0 2015-08-17 14:31:41

解决方案1
0 2015-08-17 14:31:41