![](/img/trans.png)
[英]java.lang.OutOfMemoryError: Java heap space in spark application
[英]OutOfMemoryError: Java heap space and memory variables in Spark
我一直在尝试执行一个scala程序,并且输出总是以某种方式看起来像这样:
15/08/17 14:13:14 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:64)
at java.lang.StringBuilder.<init>(StringBuilder.java:97)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:339)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2344)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:44)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:169)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:34)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1215)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
或像这样
15/08/19 11:45:11 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:526)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:505)
at com.fasterxml.jackson.databind.ObjectMapper._serializerProvider(ObjectMapper.java:2846)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
这些错误是在驱动程序还是执行程序方面?
我对Spark使用的内存变量有些困惑。 我当前的设置是
星火环境
export SPARK_WORKER_MEMORY=6G
export SPARK_DRIVER_MEMORY=6G
export SPARK_EXECUTOR_MEMORY=4G
spark-defaults.conf
# spark.driver.memory 6G
# spark.executor.memory 4G
# spark.executor.extraJavaOptions ' -Xms5G -Xmx5G '
# spark.driver.extraJavaOptions ' -Xms5G -Xmx5G '
我是否需要取消注释spark-defaults.conf中包含的任何变量,或者它们是否多余?
例如,设置SPARK_WORKER_MEMORY
等效于设置spark.executor.memory
?
我的scala代码的一部分在几次迭代后停止了:
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect
for (id <- filteredNodesGroups){
val clusterGraph = connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id)
val pagerankGraph = clusterGraph.pageRank(0.15)
val completeClusterPagerankGraph = clusterGraph.outerJoinVertices(pagerankGraph.vertices) {
case (uid, attrList, Some(pr)) =>
attrList :+ ("inClusterPagerank:" + pr)
case (uid, attrList, None) =>
attrList :+ ""
}
val sortedClusterNodes = completeClusterPagerankGraph.vertices.toArray.sortBy(_._2(pagerankIndex + 1))
println(sortedClusterNodes(0)._2(1) + " with rank: " + sortedClusterNodes(0)._2(pagerankIndex + 1))
}
许多问题伪装成一个问题。 先感谢您!
我不是Spark专家,但是有一行对我来说似乎可疑:
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect
基本上,通过使用collect方法,您会将所有数据从执行程序(甚至在处理之前)取回给驱动程序。 您对这些数据的大小有任何想法吗?
为了解决此问题,您应该以更具功能性的方式进行。 要提取不同的值,您可以例如使用groupBy和map:
val pairs = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }
pairs.groupBy(_./* the property to group on */)
.map { case (_, arrays) => /* map function */ }
关于收集,应该有一种方法可以对每个分区进行排序,然后将(处理后的)结果返回给驱动程序。 我想为您提供更多帮助,但我需要更多有关您要做什么的信息。
更新
挖一点点后,你可以在你的数据使用洗牌所描述的排序这里
更新
到目前为止,我已尝试避免收集,并尽可能将数据返回驱动程序,但我不知道如何解决此问题:
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct()
val clusterGraphs = filteredNodesGroups.map { id => connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id) }
val pageRankGraphs = clusterGraphs.map(_.pageRank(0.15))
基本上,您需要连接两个RDD [Graph [Array [Array [String],String]],但我不知道要使用什么键,其次,这必然会返回RDD的RDD(我不知道是否可以甚至这样做)。 我将在今天晚些时候尝试找到一些东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.