Spark在Gzip中编码并发送到S3 - java.io.IOException：设备上没有剩余空间

Question

I'm trying to GZIP and send an RDD over to S3 like so: 我正在尝试GZIP并将RDD发送到S3，如下所示：

dwPartitioned.saveAsTextFile(s"s3n://$accessKey:$secretKey@bucket", classOf[GzipCodec])

The job starts running and shortly after comes up with: 工作开始运行，并在提出后不久：

org.apache.spark.SparkException: Job aborted due to stage failure:  ... : java.io.IOException: No space left on device

I read that because of the encoding there is some shuffling done which requires temporary files to be generated. 我读到，因为编码有一些改组，需要生成临时文件。 Is that true? 真的吗？ Am I misusing the functionality? 我滥用了这个功能吗？ Is there something that I can optimize here? 有什么我可以在这里优化的吗？

More importantly - how can I achieve this in memory? 更重要的是 - 我怎样才能在记忆中实现这一目标？

If you need more info I'll gladly append it. 如果您需要更多信息，我很乐意附加它。

Answer 1

By default, spark uses " /tmp " to save intermediate files. 默认情况下，spark使用“ /tmp ”来保存中间文件。 When the job is running, you can tab " df -h " to see the used space of fs mounted at "/" growing up. 当作业运行时，您可以选中“ df -h ”以查看在“/”成长时安装的fs的已用空间。 When the space of the dev is runned out of, this exception is throwed. 当dev的空间耗尽时，抛出此异常。 To solve the problem, set the SPARK_LOCAL_DIRS in the SPARK_HOME/conf/spark_defaults.conf with a path in a fs leaving enough space. 要解决此问题，请在SPARK_HOME/conf/spark_defaults.conf设置SPARK_LOCAL_DIRS，并在fs留下足够的空间。

Spark在Gzip中编码并发送到S3 - java.io.IOException：设备上没有剩余空间

问题描述

1 个解决方案

解决方案1
0 2016-08-04 14:00:45

Spark在Gzip中编码并发送到S3 - java.io.IOException：设备上没有剩余空间

问题描述

1 个解决方案

解决方案1 0 2016-08-04 14:00:45

解决方案1
0 2016-08-04 14:00:45