简体繁体 English

使用Spark覆盖S3文件

[英]Overwrite S3 files using Spark

原文 2018-04-18 05:01:56 5 2 apache-spark/ amazon-s3

I have a use-case where after performing a join between two datasets, I need to write each row to a separate file (updating existing file) on S3. 我有一个用例，在两个数据集之间执行联接后，我需要将每一行写入S3上的单独文件（更新现有文件）。 Does Spark support this? Spark支持吗？

If not, can we use S3 client explicitly to write each entry to a new file in S3? 如果不是，是否可以使用S3客户端显式地将每个条目写入S3中的新文件？ Are there any side-effects that I should be aware of? 我应该注意哪些副作用？

2 个解决方案

it's not about Spark.. S3 doesn't support update, you should store the whole block at once. 这与Spark无关。S3不支持更新，您应该立即存储整个块。

in theory you could use multipart upload (MPU) to join multiple s3 object parts, however MPU is intended to support upload bigger than 5GB and minimal part size is 5MB 从理论上讲，您可以使用分段上传（MPU）来连接多个s3对象部件，但是MPU旨在支持大于5GB的上传并且最小部件大小为5MB

Each job can create new S3 object ( example ) 每个作业都可以创建新的S3对象（示例）

As I said in my comment, tons of small files in S3 are usually a bad practice. 正如我在评论中所说，S3中的大量小文件通常是一种不好的做法。 That said, if you only have a limited amount of records to write, there are different options. 就是说，如果您只写有限数量的记录，则有不同的选择。

Here are some examples: 这里有些例子：

use the DataFrameWriter, overwrite mode and partitionBy by a unique column 使用DataFrameWriter，覆盖模式和partitionBy通过唯一列
use df.rdd.mapPartitions and write each records to S3 manually using hadoop S3FileSystem. 使用df.rdd.mapPartitions并使用hadoop S3FileSystem将每个记录手动写入S3。

Good luck. 祝好运。

Apache Spark - 使用动态分区覆盖和 S3 提交器将 Parquet 文件写入 S3 - Apache Spark - Write Parquet Files to S3 with both Dynamic Partition Overwrite and S3 Committer

AWS Glue vs AWS EMR-在Spark作业中覆盖S3文件 - AWS Glue vs AWS EMR - Overwrite S3 files in Spark job

Spark 覆盖 aws s3 上的镶木地板文件引发 URISyntaxException：绝对 URI 中的相对路径 - Spark overwrite parquet files on aws s3 raise URISyntaxException: Relative path in absolute URI

如何使用Spark捆绑S3中的许多文件 - How to bundle many files in S3 using Spark

从Apache Spark保存S3文件而不使用操作 - Saving S3 files from Apache Spark without using action

如何使用 Spark Session 列出 S3 存储桶中的文件？ - How to list files in S3 bucket using Spark Session?

Spark有没有办法在不使用Hadoop的情况下读取AWS S3文件？ - Is there a way for Spark to read AWS S3 files without using Hadoop?

Spark：如何使用子集日期读取多个s3文件 - Spark: How to read multiple s3 files using subset date

Spark：如何覆盖 S3 文件夹上的文件而不是完整的文件夹 - Spark: How to overwrite a file on S3 folder and not complete folder

AWS EMR 上的 Spark - 动态分区覆盖 S3 / Glue - Spark on AWS EMR - Dynamic partition overwrite S3 / Glue

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Spark - 使用动态分区覆盖和 S3 提交器将 Parquet 文件写入 S3 - Apache Spark - Write Parquet Files to S3 with both Dynamic Partition Overwrite and S3 Committer AWS Glue vs AWS EMR-在Spark作业中覆盖S3文件 - AWS Glue vs AWS EMR - Overwrite S3 files in Spark job Spark 覆盖 aws s3 上的镶木地板文件引发 URISyntaxException：绝对 URI 中的相对路径 - Spark overwrite parquet files on aws s3 raise URISyntaxException: Relative path in absolute URI 如何使用Spark捆绑S3中的许多文件 - How to bundle many files in S3 using Spark 从Apache Spark保存S3文件而不使用操作 - Saving S3 files from Apache Spark without using action 如何使用 Spark Session 列出 S3 存储桶中的文件？ - How to list files in S3 bucket using Spark Session? Spark有没有办法在不使用Hadoop的情况下读取AWS S3文件？ - Is there a way for Spark to read AWS S3 files without using Hadoop? Spark：如何使用子集日期读取多个s3文件 - Spark: How to read multiple s3 files using subset date Spark：如何覆盖 S3 文件夹上的文件而不是完整的文件夹 - Spark: How to overwrite a file on S3 folder and not complete folder AWS EMR 上的 Spark - 动态分区覆盖 S3 / Glue - Spark on AWS EMR - Dynamic partition overwrite S3 / Glue

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM