简体   繁体   English

使用Spark覆盖S3文件

[英]Overwrite S3 files using Spark

I have a use-case where after performing a join between two datasets, I need to write each row to a separate file (updating existing file) on S3. 我有一个用例,在两个数据集之间执行联接后,我需要将每一行写入S3上的单独文件(更新现有文件)。 Does Spark support this? Spark支持吗?

If not, can we use S3 client explicitly to write each entry to a new file in S3? 如果不是,是否可以使用S3客户端显式地将每个条目写入S3中的新文件? Are there any side-effects that I should be aware of? 我应该注意哪些副作用?

it's not about Spark.. S3 doesn't support update, you should store the whole block at once. 这与Spark无关。S3不支持更新,您应该立即存储整个块。

in theory you could use multipart upload (MPU) to join multiple s3 object parts, however MPU is intended to support upload bigger than 5GB and minimal part size is 5MB 从理论上讲,您可以使用分段上传 (MPU)来连接多个s3对象部件,但是MPU旨在支持大于5GB的上传并且最小部件大小为5MB

Each job can create new S3 object ( example ) 每个作业都可以创建新的S3对象( 示例

As I said in my comment, tons of small files in S3 are usually a bad practice. 正如我在评论中所说,S3中的大量小文件通常是一种不好的做法。 That said, if you only have a limited amount of records to write, there are different options. 就是说,如果您只写有限数量的记录,则有不同的选择。

Here are some examples: 这里有些例子:

  • use the DataFrameWriter, overwrite mode and partitionBy by a unique column 使用DataFrameWriter,覆盖模式和partitionBy通过唯一列
  • use df.rdd.mapPartitions and write each records to S3 manually using hadoop S3FileSystem. 使用df.rdd.mapPartitions并使用hadoop S3FileSystem将每个记录手动写入S3。

Good luck. 祝好运。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM