简体   繁体   English

对于大数据,spark WholeTextFiles失败

[英]spark wholeTextFiles fails for large data

I use pyspark version 1.5.0 with Cloudera 5.5.0. 我将pyspark版本1.5.0与Cloudera 5.5.0一起使用。 All scripts are running fine except when I use sc.wholeTextFiles . 除了使用sc.wholeTextFiles之外,所有脚本都运行良好。 Using this command gives an error: 使用此命令会出现错误:

Kryo Serialization failed: Buffer overflow. Available:0, required: 23205706. To avoid this, increase spark.kryoserializer.buffer.max

However, I don't find the property spark.kryoserializer.buffer.max in the spark web UI; 但是,我在spark Web UI中找不到属性spark.kryoserializer.buffer.max it is not present under the Environment tab in Spark web UI. 它不存在于Spark Web UI的“ Environment选项卡下。 The only "kryo" in this page is the value org.apache.spark.selializer.KryoSerializer of the name spark.serializer . 此页面中唯一的“ kryo”是值为spark.serializer的值org.apache.spark.selializer.KryoSerializer

Why can't I see this property? 为什么我看不到此属性? And how to fix the problem? 以及如何解决该问题?

EDIT 编辑

Turns out that the Kryo error was caused by a printing to the shell . 原来Kryo错误对shell的打印引起的 Without printing, the error is actually java.io.IOExceptionL Filesystem closed ! 如果不打印,则错误实际上是java.io.IOExceptionL Filesystem closed The script now works correctly for a small portion of the data, but running it on all of the data (about 500GB , 10,000 files ) returns this error. 现在,脚本可以对一小部分数据正确运行,但是在所有数据(大约500GB10,000 files )上运行该脚本将返回此错误。

I tried to pass in the propery --conf "spak.yarn.executor.memoryOverhead=2000" , and it seems that it allows a slightly larger part of the data to be read, but it still ultimately fails on the full data. 我试图传递--conf "spak.yarn.executor.memoryOverhead=2000" ,似乎它允许读取稍大部分的数据,但最终仍会在完整数据上失败。 It takes 10-15 minutes of running before the error appears. 出现错误需要10到15分钟的运行时间。

The RDD is big, but the error is produced even when only doing .count() on it. RDD很大,但是即使仅对其执行.count()也会产生错误。

You should pass such property when submitting a job. 提交工作时,您应该通过此类财产。 This is why it's not in Cloudera UI. 这就是为什么它不在Cloudera UI中。 http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html http://www.cloudera.com/content/www/zh-cn/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html

In your case: --conf "spark.kryoserializer.buffer.max = 64M" (for example) 在您的情况下:-- --conf "spark.kryoserializer.buffer.max = 64M" (例如)

Also, I'm not sure but it might happen that if you increase Kryo buffer you might want to increase akka frame size. 另外,我不确定,但是可能会发生,如果您增加Kryo缓冲区,则可能要增加akka帧大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM