简体繁体中英

spark wholeTextFiles fails for large data

原文 2015-12-30 09:13:46 9 1 apache-spark/ pyspark

I use pyspark version 1.5.0 with Cloudera 5.5.0. All scripts are running fine except when I use sc.wholeTextFiles . Using this command gives an error:

Kryo Serialization failed: Buffer overflow. Available:0, required: 23205706. To avoid this, increase spark.kryoserializer.buffer.max

However, I don't find the property spark.kryoserializer.buffer.max in the spark web UI; it is not present under the Environment tab in Spark web UI. The only "kryo" in this page is the value org.apache.spark.selializer.KryoSerializer of the name spark.serializer .

Why can't I see this property? And how to fix the problem?

EDIT

Turns out that the Kryo error was caused by a printing to the shell . Without printing, the error is actually java.io.IOExceptionL Filesystem closed ! The script now works correctly for a small portion of the data, but running it on all of the data (about 500GB , 10,000 files ) returns this error.

I tried to pass in the propery --conf "spak.yarn.executor.memoryOverhead=2000" , and it seems that it allows a slightly larger part of the data to be read, but it still ultimately fails on the full data. It takes 10-15 minutes of running before the error appears.

The RDD is big, but the error is produced even when only doing .count() on it.

1 answers

You should pass such property when submitting a job. This is why it's not in Cloudera UI. http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html

In your case: --conf "spark.kryoserializer.buffer.max = 64M" (for example)

Also, I'm not sure but it might happen that if you increase Kryo buffer you might want to increase akka frame size.

Spark data manipulation with wholeTextFiles

Spark Streaming with wholeTextFiles

Spark textFile vs wholeTextFiles

Spark SQL in Google Colab fails on large data

Java Spark unable to process wholeTextFiles

How to read gz files in Spark using wholeTextFiles

Will spark wholetextfiles pick partially created file?

spark read wholeTextFiles with non UTF-8 encoding

Spark wholeTextFiles difference between shell and app

Spark getting line number zipWithIndex with wholeTextFiles

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark data manipulation with wholeTextFiles Spark Streaming with wholeTextFiles Spark textFile vs wholeTextFiles Spark SQL in Google Colab fails on large data Java Spark unable to process wholeTextFiles How to read gz files in Spark using wholeTextFiles Will spark wholetextfiles pick partially created file? spark read wholeTextFiles with non UTF-8 encoding Spark wholeTextFiles difference between shell and app Spark getting line number zipWithIndex with wholeTextFiles

Related Tags

spark wholeTextFiles fails for large data

Question

1 answers

solution1 0 2015-12-30 09:34:19

solution1
0 2015-12-30 09:34:19