将 Spark DF 保存到 s3 路径是否首先将数据写入 EBS 卷？

Question

I am curious to know what happens behind the scenes when writing Spark DF as a Parquet file on S3 location.我很想知道在 S3 位置将 Spark DF 编写为 Parquet 文件时幕后发生了什么。 Does it first stores it locally on the local file system(EBS in our case) and then pushes those files to S3 finally?.它是否首先将其本地存储在本地文件系统（在我们的例子中为 EBS），然后最终将这些文件推送到 S3？ Once the local files are successfully pushed, Are the local files on the EBS volume deleted?.本地文件推送成功后，EBS卷上的本地文件是否被删除？

If it stores locally on EBS volume, what is the path it chooses to write the local files?.如果它本地存储在 EBS 卷上，它选择写入本地文件的路径是什么？。 What is the spark configuration property that sets this path?.设置此路径的 spark 配置属性是什么？

Sample code that saves DF to S3 location:将 DF 保存到 S3 位置的示例代码：

df.repartition(prtn_col[0]).write.format(self.config_dict['output_format']).mode('overwrite').partitionBy(
                *prtn_col).save(path)

Software Versions used使用的软件版本

Python version: 3.7.0 Python 版本：3.7.0
PySpark version: 2.4.7 PySpark 版本：2.4.7
Emr: 5.32.0 Emr：5.32.0

Please let me know if you like to me to share any other info that will help answering this question.如果您希望我分享有助于回答此问题的任何其他信息，请告诉我。

Answer 1

If you are using S3 URIs that look like s3://bucket/path on EMR (as opposed to s3a://bucket/path), you are using a component called EMRFS, EMR's closed source implementation of the S3 filesystem connector.如果您使用的 S3 URI 在 EMR 上看起来像 s3://bucket/path（而不是 s3a://bucket/path），则您使用的是一个名为 EMRFS 的组件，它是 EMR 的 S3 文件系统连接器的闭源实现。 When writing files to S3 (including via Spark DataFrame write operations), EMRFS does write the file first to temporary files on the local filesystem.将文件写入 S3 时（包括通过 Spark DataFrame 写入操作），EMRFS 会先将文件写入本地文件系统上的临时文件。 And yes, the temporary files are deleted upon completing the upload (or at some point in the future, such as when the YARN container running the Spark executor is terminated, if the S3 upload does not complete normally).是的，临时文件会在上传完成后被删除（或者在将来的某个时候，例如当运行 Spark 执行程序的 YARN 容器终止时，如果 S3 上传没有正常完成）。

The temporary files might be written to EBS, but it depends upon what types of volumes are attached to the cluster nodes that are performing the writes.临时文件可能会写入 EBS，但这取决于附加到执行写入的集群节点的卷类型。 Many instance types are "EBS-only", meaning that they don't have any of their own storage (such as SSDs) and require you to attach EBS volumes.许多实例类型是“仅限 EBS”的，这意味着它们没有任何自己的存储（例如 SSD）并且需要您附加 EBS 卷。 (If left unspecified, EMR will also automatically add a certain number of EBS volumes by default to EBS-only instance types.) （如果未指定，EMR 还会默认自动将一定数量的 EBS 卷添加到 EBS-only 实例类型。）

将 Spark DF 保存到 s3 路径是否首先将数据写入 EBS 卷？

问题描述

1 个解决方案

解决方案1
0 2023-01-28 20:55:29

将 Spark DF 保存到 s3 路径是否首先将数据写入 EBS 卷？

问题描述

1 个解决方案

解决方案1 0 2023-01-28 20:55:29

解决方案1
0 2023-01-28 20:55:29