简体   繁体   English

Spark:对Parquet的读写导致OutOfMemoryError:Java堆空间

[英]Spark: Read and Write to Parquet leads to OutOfMemoryError: Java heap space

I wrote some code to read a parquet file, switch the schema slightly and write the data to a new parquet file. 我编写了一些代码来读取镶木地板文件,略微切换架构,然后将数据写入新的镶木地板文件中。 The code looks as follows: 该代码如下所示:

...
val schema = StructType(
  List(
    StructField("id", LongType, false),
    StructField("data", ArrayType(FloatType), false)
  )
)

val data = sqlContext.read.parquet(file.getAbsolutePath)
val revisedData = data.map(r =>  Row(r.getInt(0).toLong, r.getSeq[Float](1)))
val df = sqlContext.createDataFrame(revisedData,  schema)

Writer.writeToParquet(df)

with Writer being Writer

object Writer {
    def writeToParquet(df : DataFrame) : Unit = {
       val future = Future {
         df.write.mode(SaveMode.Append).save(path)
       }

       Await.ready(future, Duration.Inf)
    }
}

For a file of about 4 GB my program breaks, raising an OutOfMemoryError: Java heap space. 对于大约4 GB的文件,我的程序中断了,并引发了OutOfMemoryError:Java堆空间。 I have set 6 GB of memory to the executor (using -Dspark.executor.memory=6g ), raised the JVM heap space (using -Xmx6g ), increased the Kryo serializer buffer to 2 GB (using System.setProperty("spark.kryoserializer.buffer.mb", "2048") ). 我为执行程序设置了6 GB内存(使用-Dspark.executor.memory=6g ),提高了JVM堆空间(使用-Xmx6g ),将Kryo序列化程序缓冲区增加到了2 GB(使用System.setProperty("spark.kryoserializer.buffer.mb", "2048") )。 However, I still get the error. 但是,我仍然会收到错误消息。

This is the stack trace: 这是堆栈跟踪:

java.lang.OutOfMemoryError: Java heap space
  at com.esotericsoftware.kryo.io.Output.<init>(Output.java:35)
  at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:76)
  at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:243)
  at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:243)
  at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:247)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:236)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:744)

What can I do to avoid this error? 我应该怎么做才能避免这个错误?

Following my comment, two things: 根据我的评论,有两件事:

1) You need to watch out with the spark.kryoserializer.buffer.mb property name, in the newest spark they changed it to spark.kryoserializer.buffer and spark.kryoserializer.buffer.max . 1)您需要小心使用spark.kryoserializer.buffer.mb属性名,在最新的spark中,他们将其更改为spark.kryoserializer.bufferspark.kryoserializer.buffer.max

2) You have to be careful with the size of the buffer and your heap size, it has to be big enough to store a single record you are writing but not much more as kryo is creating an explicit byte[] of that size (and allocating a single byte array for 2GB is usually a bad idea). 2)您必须小心缓冲区的大小和堆的大小,它必须足够大以存储您正在写入的单个记录,但不能太多,因为kryo正在创建该大小的显式byte[] (并且为2GB分配一个单byte数组通常是个坏主意)。 Try lowering your buffer size with the proper property. 尝试使用适当的属性减小缓冲区大小。

Using sparklyr, having the same OutOfMemoryError, despite reducing spark.kryoserializer.buffer, not beeing able to read a parquet a file I had been able to write, my solution was to: 使用sparklyr,尽管减少了spark.kryoserializer.buffer,但具有相同的OutOfMemoryError,又无法读取木地板文件,但我的解决方案是:

disable the "eager" memory load option : (memory=FALSE) 禁用“渴望”的内存加载选项 :(memory = FALSE)

spark_read_parquet(sc,name=curName,file.path("file://",srcFile), header=true, memory=FALSE)

spark 2.3.0 sparklyr 1.0.0 R version 3.4.2 spark 2.3.0 sparklyr 1.0.0 R版本3.4.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 OutOfMemoryError:Spark中的Java堆空间和内存变量 - OutOfMemoryError: Java heap space and memory variables in Spark java.lang.OutOfMemoryError:spark应用程序中的Java堆空间 - java.lang.OutOfMemoryError: Java heap space in spark application Spark Cassandra聚合java.lang.OutOfMemoryError:Java堆空间 - Spark Cassandra Aggregation java.lang.OutOfMemoryError: Java heap space &#39;java.lang.OutOfMemoryError: Java heap space&#39; 在尝试读取 avro 文件并执行操作时 Spark 应用程序中出现错误 - 'java.lang.OutOfMemoryError: Java heap space' error in spark application while trying to read the avro file and performing Actions 将火花插入Java堆空间 - Spark insertInto Java Heap Space 使用Spark配置Java堆空间 - Configure Java heap space with Spark Spark mllib中的Java堆空间 - Java heap space in spark mllib Spark Scala 代码中的“线程“dispatcher-event-loop-0”中的异常 java.lang.OutOfMemoryError: Java heap space &#39;错误 - 'Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: Java heap space ' error in Spark Scala code StringBuilder - java.lang.OutOfMemoryError: Java 堆空间 - StringBuilder - java.lang.OutOfMemoryError: Java heap space sbt, "java.lang.OutOfMemoryError: Java 堆空间" - sbt, "java.lang.OutOfMemoryError: Java heap space"
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM