简体   繁体   English

Spark Scala 代码中的“线程“dispatcher-event-loop-0”中的异常 java.lang.OutOfMemoryError: Java heap space '错误

[英]'Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: Java heap space ' error in Spark Scala code

val data = spark.read
    .text(filePath)
    .toDF("val")
    .withColumn("id", monotonically_increasing_id())



    val count = data.count()



    val header = data.where("id==1").collect().map(s => s.getString(0)).apply(0)



    val columns = header
    .replace("H|*|", "")
    .replace("|##|", "")
    .split("\\|\\*\\|")


    val structSchema = StructType(columns.map(s=>StructField(s, StringType, true)))



    var correctData = data.where('id > 1 && 'id < count-1).select("val")
    var dataString = correctData.collect().map(s => s.getString(0)).mkString("").replace("\\\n","").replace("\\\r","")
    var dataArr = dataString.split("\\|\\#\\#\\|").map(s =>{ 
                                                          var arr = s.split("\\|\\*\\|")
                                                          while(arr.length < columns.length) arr = arr :+ ""
                                                          RowFactory.create(arr:_*)
                                                         })
    val finalDF = spark.createDataFrame(sc.parallelize(dataArr),structSchema)

    display(finalDF)

This portion of code giving error:这部分代码给出错误:

Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: Java heap space线程“dispatcher-event-loop-0”中的异常 java.lang.OutOfMemoryError: Java heap space

After hours of debugging mainly the part:经过几个小时的调试主要是以下部分:

var dataArr = dataString.split("\\|\\#\\#\\|").map(s =>{ 
                                                          var arr = s.split("\\|\\*\\|")
                                                          while(arr.length < columns.length) arr = arr :+ ""
                                                          RowFactory.create(arr:_*)
                                                         })
    val finalDF = spark.createDataFrame(sc.parallelize(dataArr),structSchema)

causing the error.导致错误。

I changed the part as我改变了部分

var dataArr = dataString.split("\\|\\#\\#\\|").map(s =>{
                                                          var arr = s.split("\\|\\*\\|")
                                                          while(arr.length < columns.length) arr = arr :+ ""
                                                          RowFactory.create(arr:_*)
                                                         }).toList
  val finalDF = sqlContext.createDataFrame(sc.makeRDD(dataArr),structSchema)

But error remains same.但错误仍然相同。 What should I change to avoid this?我应该改变什么来避免这种情况?

When I ran this code is databricks spark cluster, particular job gives this Spark driver error:当我运行这段代码是 databricks spark cluster 时,特定的工作给出了这个 Spark 驱动程序错误:

Job aborted due to stage failure: Serialized task 45:0 was 792585456 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes).由于阶段失败,作业中止:序列化任务 45:0 为 792585456 字节,超过了最大允许值:spark.rpc.message.maxSize(268435456 字节)。

I added this portion of code:我添加了这部分代码:

spark.conf.set("spark.rpc.message.maxSize",Int.MaxValue)

but of no use.但没有用。

My guess is that我的猜测是

var dataString = correctData.collect().map(s => s.getString(0)).mkString("").replace("\\\n","").replace("\\\r","")

is the problem, because you collect (almost) all of the data to the driver, ie to 1 single JVM.是问题所在,因为您将(几乎)所有数据收集到驱动程序,即 1 个单个 JVM。

Maybe this line runs, but subsequent operations on dataString will exceed your memory limits.也许这条线会运行,但随后对dataString操作将超出您的内存限制。 You should not collect your data!你不应该收集你的数据! Instead, work with distributed "data structures" such as Dataframe or RDD.相反,使用分布式“数据结构”,例如 Dataframe 或 RDD。

I think you could just omit the collect in the above line我想你可以省略上面一行中的collect

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 线程“dispatcher-event-loop-1”中的异常 java.lang.OutOfMemoryError: Java heap space - Exception in thread “dispatcher-event-loop-1” java.lang.OutOfMemoryError: Java heap space 线程“dispatcher-event-loop-5”java.lang.OutOfMemoryError 中的异常:超出 GC 开销限制:Spark - Exception in thread "dispatcher-event-loop-5" java.lang.OutOfMemoryError: GC overhead limit exceeded : Spark 线程“main”中的异常 java.lang.OutOfMemoryError: Java heap space - Exception in thread “main” java.lang.OutOfMemoryError: Java heap space 2线程“ main”中的异常java.lang.OutOfMemoryError:Java堆空间 - 2 Exception in thread “main” java.lang.OutOfMemoryError: Java heap space 线程“main”中的异常java.lang.OutOfMemoryError:Java堆空间 - Exception in thread “main” java.lang.OutOfMemoryError: Java heap space 为什么此代码会在线程“main”java.lang.OutOfMemoryError: Java heap space 中抛出异常? - Why does this code throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space? 为什么我遇到以下代码的跟随错误:线程“main”中的异常java.lang.OutOfMemoryError:Java堆空间 - Why I am encountering following error for below code :Exception in thread “main” java.lang.OutOfMemoryError: Java heap space 错误:线程“ main”中的异常java.lang.OutOfMemoryError:Java堆空间 - Error:Exception in thread “main” java.lang.OutOfMemoryError: Java heap space 如何摆脱-“线程“ AWT-EventQueue-0”中的异常java.lang.OutOfMemoryError:Java堆空间”错误? - How to get rid of - “Exception in thread ”AWT-EventQueue-0“ java.lang.OutOfMemoryError: Java heap space” error? 为什么会出现“线程中的异常”AWT-EventQueue-0“java.lang.OutOfMemoryError: Java heap space”错误? - Why does this get “Exception in thread ”AWT-EventQueue-0“ java.lang.OutOfMemoryError: Java heap space” error?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM