简体繁体 English

Spark saveAsTextFile写入空文件- <directory> _ $ folder $到S3

[英]Spark saveAsTextFile writes empty file - <directory>_$folder$ to S3

原文 2017-12-26 02:07:07 7 1 hadoop/ apache-spark/ amazon-s3/ apache-spark-sql

rdd.saveAsTextFile("s3n://bucket-name/path) is creating an empty file with folder name as - [folder-name]_$folder$ Seems like this empty file in used by hadoop-aws jar (of org.apache.hadoop) to mimick S3 filesystem as hadoop filesystem. rdd.saveAsTextFile("s3n://bucket-name/path)正在创建一个空文件，其文件夹名称为- [folder-name]_$folder$似乎是hadoop-aws jar (of org.apache.hadoop)使用的空文件hadoop-aws jar (of org.apache.hadoop)将S3文件系统模仿为hadoop文件系统。

But, my application writes thousands of files to S3. 但是，我的应用程序将数千个文件写入S3。 As saveAsTextFile creates folder (from the given path) to write the data (from rdd) my application ends up creating thousands of these empty files - [directory-name]_$folder$ . 当saveAsTextFile从给定路径创建文件夹（从rdd写入数据）时，我的应用程序最终创建了数千个空文件- [directory-name]_$folder$ 。

Is there a way to make rdd.saveAsTextFile not to write these empty files? 有没有办法让rdd.saveAsTextFile不写这些空文件？

1 个解决方案

Stop using s3n, switch to s3a. 停止使用s3n，切换到s3a。 It's faster and actually supported. 它更快并且得到实际支持。 that will make this issue go away, along with the atrocious performance problems reading large Parquet/ORC files. 这样一来，这个问题就会消失，再加上读取大型Parquet / ORC文件时出现的性能严重问题。

Also, if your app is creating thousands of small files in S3, you are creating future performance problems: listing and opening files on S3 is slow. 另外，如果您的应用程序在S3中创建了数千个小文件，那么您还将产生未来的性能问题：在S3上列出和打开文件很慢。 Try to combine source data into larger columnar-formatted files & use whatever SELECT mechanism your framework has to only read the bits you want 尝试将源数据合并到较大的列格式文件中，并使用您的框架必须仅读取所需位的任何SELECT机制