![](/img/trans.png)
[英]java.lang.OutOfMemoryError: Java heap space in spark application
[英]OutOfMemoryError: Java heap space and memory variables in Spark
我一直在嘗試執行一個scala程序,並且輸出總是以某種方式看起來像這樣:
15/08/17 14:13:14 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:64)
at java.lang.StringBuilder.<init>(StringBuilder.java:97)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:339)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2344)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:44)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:143)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:169)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:34)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1215)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
或像這樣
15/08/19 11:45:11 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:526)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider$Impl.createInstance(DefaultSerializerProvider.java:505)
at com.fasterxml.jackson.databind.ObjectMapper._serializerProvider(ObjectMapper.java:2846)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:17)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:280)
at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1255)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:22)
at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
這些錯誤是在驅動程序還是執行程序方面?
我對Spark使用的內存變量有些困惑。 我當前的設置是
星火環境
export SPARK_WORKER_MEMORY=6G
export SPARK_DRIVER_MEMORY=6G
export SPARK_EXECUTOR_MEMORY=4G
spark-defaults.conf
# spark.driver.memory 6G
# spark.executor.memory 4G
# spark.executor.extraJavaOptions ' -Xms5G -Xmx5G '
# spark.driver.extraJavaOptions ' -Xms5G -Xmx5G '
我是否需要取消注釋spark-defaults.conf中包含的任何變量,或者它們是否多余?
例如,設置SPARK_WORKER_MEMORY
等效於設置spark.executor.memory
?
我的scala代碼的一部分在幾次迭代后停止了:
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect
for (id <- filteredNodesGroups){
val clusterGraph = connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id)
val pagerankGraph = clusterGraph.pageRank(0.15)
val completeClusterPagerankGraph = clusterGraph.outerJoinVertices(pagerankGraph.vertices) {
case (uid, attrList, Some(pr)) =>
attrList :+ ("inClusterPagerank:" + pr)
case (uid, attrList, None) =>
attrList :+ ""
}
val sortedClusterNodes = completeClusterPagerankGraph.vertices.toArray.sortBy(_._2(pagerankIndex + 1))
println(sortedClusterNodes(0)._2(1) + " with rank: " + sortedClusterNodes(0)._2(pagerankIndex + 1))
}
許多問題偽裝成一個問題。 先感謝您!
我不是Spark專家,但是有一行對我來說似乎可疑:
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct.collect
基本上,通過使用collect方法,您會將所有數據從執行程序(甚至在處理之前)取回給驅動程序。 您對這些數據的大小有任何想法嗎?
為了解決此問題,您應該以更具功能性的方式進行。 要提取不同的值,您可以例如使用groupBy和map:
val pairs = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }
pairs.groupBy(_./* the property to group on */)
.map { case (_, arrays) => /* map function */ }
關於收集,應該有一種方法可以對每個分區進行排序,然后將(處理后的)結果返回給驅動程序。 我想為您提供更多幫助,但我需要更多有關您要做什么的信息。
更新
挖一點點后,你可以在你的數據使用洗牌所描述的排序這里
更新
到目前為止,我已嘗試避免收集,並盡可能將數據返回驅動程序,但我不知道如何解決此問題:
val filteredNodesGroups = connCompGraph.vertices.map{ case(_, array) => array(pagerankIndex) }.distinct()
val clusterGraphs = filteredNodesGroups.map { id => connCompGraph.subgraph(vpred = (_, attr) => attr(pagerankIndex) == id) }
val pageRankGraphs = clusterGraphs.map(_.pageRank(0.15))
基本上,您需要連接兩個RDD [Graph [Array [Array [String],String]],但我不知道要使用什么鍵,其次,這必然會返回RDD的RDD(我不知道是否可以甚至這樣做)。 我將在今天晚些時候嘗試找到一些東西。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.