简体   繁体   English

OutOfMemoryError:Spark中的Java堆空间和内存变量

[英]OutOfMemoryError: Java heap space and memory variables in Spark

I have been trying to execute a scala program and the output somehow always seems to be something like this: 我一直在尝试执行一个scala程序,并且输出总是以某种方式看起来像这样:

15/08/17 14:13:14 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:64)
at java.lang.StringBuilder.<init>(StringBuilder.java:97)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:339)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2344)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:44)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:169)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:34)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1215)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)

or like this 或像这样

15/08/19 11:45:11 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:526)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:505)
    at com.fasterxml.jackson.databind.ObjectMapper._serializerProvider(ObjectMapper.java:2846)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
    at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
    at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
    at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
    at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
    at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)

Are these errors on the driver or executor side? 这些错误是在驱动程序还是执行程序方面?

I am a bit confused with the memory variables that Spark uses. 我对Spark使用的内存变量有些困惑。 My current settings are 我当前的设置是

spark-env.sh 星火环境

export SPARK_WORKER_MEMORY=6G
export SPARK_DRIVER_MEMORY=6G
export SPARK_EXECUTOR_MEMORY=4G

spark-defaults.conf spark-defaults.conf

# spark.driver.memory              6G
# spark.executor.memory            4G
# spark.executor.extraJavaOptions  ' -Xms5G -Xmx5G '
# spark.driver.extraJavaOptions   ' -Xms5G -Xmx5G '

Do I need to uncomment any of the variables contained in spark-defaults.conf, or are they redundant? 我是否需要取消注释spark-defaults.conf中包含的任何变量,或者它们是否多余?

Is for example setting SPARK_WORKER_MEMORY equivalent to setting the spark.executor.memory ? 例如,设置SPARK_WORKER_MEMORY等效于设置spark.executor.memory

Part of my scala code where it stops after a few iterations: 我的scala代码的一部分在几次迭代后停止了:

   val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect
    for (id <- filteredNodesGroups){
        val clusterGraph = connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id)
        val pagerankGraph = clusterGraph.pageRank(0.15)
        val completeClusterPagerankGraph = clusterGraph.outerJoinVertices(pagerankGraph.vertices) {
            case (uid, attrList, Some(pr)) => 
                attrList :+ ("inClusterPagerank:" + pr)
            case (uid, attrList, None) => 
                attrList :+ ""
        }
        val sortedClusterNodes = completeClusterPagerankGraph.vertices.toArray.sortBy(_._2(pagerankIndex + 1))
       println(sortedClusterNodes(0)._2(1) + " with rank: " + sortedClusterNodes(0)._2(pagerankIndex + 1))

     }        

Many questions disguised as one. 许多问题伪装成一个问题。 Thank you in advance! 先感谢您!

I'm not a Spark expert, but there is line that seems suspicious to me : 我不是Spark专家,但是有一行对我来说似乎可疑:

val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect

Basically, by using the collect method, you are getting back all the data from your executors (before even processing it) to the driver. 基本上,通过使用collect方法,您会将所有数据从执行程序(甚至在处理之前)取回给驱动程序。 Do you have any idea about the size of this data ? 您对这些数据的大小有任何想法吗?

In order to fix this, you should proceed in a more functional way. 为了解决此问题,您应该以更具功能性的方式进行。 To extract the distinct values, you could for example use a groupBy and map : 要提取不同的值,您可以例如使用groupBy和map:

val pairs = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }
pairs.groupBy(_./* the property to group on */)
     .map { case (_, arrays) => /* map function */ }

Regarding the collect, there should be a way to sort each partition and then to return the (processed) result to the driver. 关于收集,应该有一种方法可以对每个分区进行排序,然后将(处理后的)结果返回给驱动程序。 I would like to help you more but I need more information about what you are trying to do. 我想为您提供更多帮助,但我需要更多有关您要做什么的信息。

UPDATE 更新

After digging a little bit, you could sort your data using shuffling as described here 挖一点点后,你可以在你的数据使用洗牌所描述的排序这里

UPDATE 更新

So far, I've tried to avoid the collect, and to get the data back to the driver as much as possible, but I've no idea how to solve this : 到目前为止,我已尝试避免收集,并尽可能将数据返回驱动程序,但我不知道如何解决此问题:

val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct()
val clusterGraphs = filteredNodesGroups.map { id => connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id) }
val pageRankGraphs = clusterGraphs.map(_.pageRank(0.15))

Basically, you need to join two RDD[Graph[Array[String], String]], but I don't know what key to use, and secondly this would necessarily return an RDD of RDD (I don't know if you can even do that). 基本上,您需要连接两个RDD [Graph [Array [Array [String],String]],但我不知道要使用什么键,其次,这必然会返回RDD的RDD(我不知道是否可以甚至这样做)。 I'll try to find something later this day. 我将在今天晚些时候尝试找到一些东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 java.lang.OutOfMemoryError:spark应用程序中的Java堆空间 - java.lang.OutOfMemoryError: Java heap space in spark application Spark Cassandra聚合java.lang.OutOfMemoryError:Java堆空间 - Spark Cassandra Aggregation java.lang.OutOfMemoryError: Java heap space Spark:对Parquet的读写导致OutOfMemoryError:Java堆空间 - Spark: Read and Write to Parquet leads to OutOfMemoryError: Java heap space 将火花插入Java堆空间 - Spark insertInto Java Heap Space 使用Spark配置Java堆空间 - Configure Java heap space with Spark Spark mllib中的Java堆空间 - Java heap space in spark mllib &#39;java.lang.OutOfMemoryError: Java heap space&#39; 在尝试读取 avro 文件并执行操作时 Spark 应用程序中出现错误 - 'java.lang.OutOfMemoryError: Java heap space' error in spark application while trying to read the avro file and performing Actions Spark Scala 代码中的“线程“dispatcher-event-loop-0”中的异常 java.lang.OutOfMemoryError: Java heap space &#39;错误 - 'Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: Java heap space ' error in Spark Scala code StringBuilder - java.lang.OutOfMemoryError: Java 堆空间 - StringBuilder - java.lang.OutOfMemoryError: Java heap space sbt, "java.lang.OutOfMemoryError: Java 堆空间" - sbt, "java.lang.OutOfMemoryError: Java heap space"
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM