简体繁体 English

对于大数据，spark WholeTextFiles失败

[英]spark wholeTextFiles fails for large data

原文 2015-12-30 09:13:46 7 1 apache-spark/ pyspark

I use pyspark version 1.5.0 with Cloudera 5.5.0. 我将pyspark版本1.5.0与Cloudera 5.5.0一起使用。 All scripts are running fine except when I use sc.wholeTextFiles . 除了使用sc.wholeTextFiles之外，所有脚本都运行良好。 Using this command gives an error: 使用此命令会出现错误：

Kryo Serialization failed: Buffer overflow. Available:0, required: 23205706. To avoid this, increase spark.kryoserializer.buffer.max

However, I don't find the property spark.kryoserializer.buffer.max in the spark web UI; 但是，我在spark Web UI中找不到属性spark.kryoserializer.buffer.max 。 it is not present under the Environment tab in Spark web UI. 它不存在于Spark Web UI的“ Environment选项卡下。 The only "kryo" in this page is the value org.apache.spark.selializer.KryoSerializer of the name spark.serializer . 此页面中唯一的“ kryo”是值为spark.serializer的值org.apache.spark.selializer.KryoSerializer 。

Why can't I see this property? 为什么我看不到此属性？ And how to fix the problem? 以及如何解决该问题？

EDIT 编辑

Turns out that the Kryo error was caused by a printing to the shell . 原来Kryo错误是由对shell的打印引起的 。 Without printing, the error is actually java.io.IOExceptionL Filesystem closed ! 如果不打印，则错误实际上是java.io.IOExceptionL Filesystem closed ！ The script now works correctly for a small portion of the data, but running it on all of the data (about 500GB , 10,000 files ) returns this error. 现在，脚本可以对一小部分数据正确运行，但是在所有数据（大约500GB ， 10,000 files ）上运行该脚本将返回此错误。

I tried to pass in the propery --conf "spak.yarn.executor.memoryOverhead=2000" , and it seems that it allows a slightly larger part of the data to be read, but it still ultimately fails on the full data. 我试图传递--conf "spak.yarn.executor.memoryOverhead=2000" ，似乎它允许读取稍大部分的数据，但最终仍会在完整数据上失败。 It takes 10-15 minutes of running before the error appears. 出现错误需要10到15分钟的运行时间。

The RDD is big, but the error is produced even when only doing .count() on it. RDD很大，但是即使仅对其执行.count()也会产生错误。

1 个解决方案

You should pass such property when submitting a job. 提交工作时，您应该通过此类财产。 This is why it's not in Cloudera UI. 这就是为什么它不在Cloudera UI中。 http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html http://www.cloudera.com/content/www/zh-cn/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html

In your case: --conf "spark.kryoserializer.buffer.max = 64M" (for example) 在您的情况下：-- --conf "spark.kryoserializer.buffer.max = 64M" （例如）

Also, I'm not sure but it might happen that if you increase Kryo buffer you might want to increase akka frame size. 另外，我不确定，但是可能会发生，如果您增加Kryo缓冲区，则可能要增加akka帧大小。