[英]Spark encode in Gzip and send to S3 - java.io.IOException: No space left on device
I'm trying to GZIP and send an RDD over to S3 like so: 我正在尝试GZIP并将RDD发送到S3,如下所示:
dwPartitioned.saveAsTextFile(s"s3n://$accessKey:$secretKey@bucket", classOf[GzipCodec])
The job starts running and shortly after comes up with: 工作开始运行,并在提出后不久:
org.apache.spark.SparkException: Job aborted due to stage failure: ... : java.io.IOException: No space left on device
I read that because of the encoding there is some shuffling done which requires temporary files to be generated. 我读到,因为编码有一些改组,需要生成临时文件。 Is that true?
真的吗? Am I misusing the functionality?
我滥用了这个功能吗? Is there something that I can optimize here?
有什么我可以在这里优化的吗?
More importantly - how can I achieve this in memory? 更重要的是 - 我怎样才能在记忆中实现这一目标?
If you need more info I'll gladly append it. 如果您需要更多信息,我很乐意附加它。
By default, spark uses " /tmp
" to save intermediate files. 默认情况下,spark使用“
/tmp
”来保存中间文件。 When the job is running, you can tab " df -h
" to see the used space of fs mounted at "/" growing up. 当作业运行时,您可以选中“
df -h
”以查看在“/”成长时安装的fs的已用空间。 When the space of the dev is runned out of, this exception is throwed. 当dev的空间耗尽时,抛出此异常。 To solve the problem, set the SPARK_LOCAL_DIRS in the
SPARK_HOME/conf/spark_defaults.conf
with a path in a fs
leaving enough space. 要解决此问题,请在
SPARK_HOME/conf/spark_defaults.conf
设置SPARK_LOCAL_DIRS,并在fs
留下足够的空间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.