简体   繁体   中英

spark wholeTextFiles fails for large data

I use pyspark version 1.5.0 with Cloudera 5.5.0. All scripts are running fine except when I use sc.wholeTextFiles . Using this command gives an error:

Kryo Serialization failed: Buffer overflow. Available:0, required: 23205706. To avoid this, increase spark.kryoserializer.buffer.max

However, I don't find the property spark.kryoserializer.buffer.max in the spark web UI; it is not present under the Environment tab in Spark web UI. The only "kryo" in this page is the value org.apache.spark.selializer.KryoSerializer of the name spark.serializer .

Why can't I see this property? And how to fix the problem?

EDIT

Turns out that the Kryo error was caused by a printing to the shell . Without printing, the error is actually java.io.IOExceptionL Filesystem closed ! The script now works correctly for a small portion of the data, but running it on all of the data (about 500GB , 10,000 files ) returns this error.

I tried to pass in the propery --conf "spak.yarn.executor.memoryOverhead=2000" , and it seems that it allows a slightly larger part of the data to be read, but it still ultimately fails on the full data. It takes 10-15 minutes of running before the error appears.

The RDD is big, but the error is produced even when only doing .count() on it.

You should pass such property when submitting a job. This is why it's not in Cloudera UI. http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html

In your case: --conf "spark.kryoserializer.buffer.max = 64M" (for example)

Also, I'm not sure but it might happen that if you increase Kryo buffer you might want to increase akka frame size.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM