简体繁体 English

Spark I/O 与 S3

[英]Spark i/o with S3

原文 2022-12-04 18:28:10 4 1 apache-spark/ amazon-s3/ databricks

Reading this below from https://blog.duyet.net/2021/04/spark-kube.netes-performance-tuning.html阅读以下内容 https://blog.duyet.net/2021/04/spark-kube.netes-performance-tuning.html

I/O with S3使用 S3 的 I/O

It's longer time to append data to an existing dataset and in particular, all of Spark jobs have finished, but your command has not finished, it is because driver node is moving the output files of tasks from the job temporary directory to the final destination one-by-one , which is slow with cloud storage (eg S3). append 数据到现有数据集的时间更长，特别是，所有 Spark 作业都已完成，但您的命令尚未完成，这是因为驱动程序节点正在将任务的 output 文件从作业临时目录移动到最终目标目录-by-one ，这在云存储（例如 S3）中很慢。

Enable this optimization: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2启用此优化：spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

I am wanting to check if the bold statement is true.我想检查大胆的说法是否属实。 I have never heard that the Spark Driver writes files / controls writing with S3.我从来没有听说过 Spark Driver 用 S3 写文件/控制写。 Sure, not an HDFS Cluster, and Spark Driver does work necessarily on reading from S3.当然，不是 HDFS 集群，而且 Spark Driver 确实需要从 S3 读取数据。 My knowledge is that the Executors write the data to data at rest, or KAFKA, even if running Spark on AWS.我的知识是执行器将数据写入 rest 或 KAFKA 的数据，即使在 AWS 上运行 Spark 也是如此。 But, presumably I am wrong, or not?但是，大概我错了，或者不是？
If true, same for ADLS2?如果为真，ADLS2 也一样吗？

The comment "I have faced the same issue, and I found It was quicker to write the content on a temporary HDFS directory and the move the content with a command such as s3-dist-cp to S3" is not what I am asking about.评论“我遇到了同样的问题，我发现将内容写入临时 HDFS 目录并使用诸如 s3-dist-cp 之类的命令将内容移动到 S3 更快”不是我要问的.

1 个解决方案

Whoever wrote that post does not full understand the whole job commit problem and is dangerously misleading.写那篇文章的人并没有完全理解整个工作提交问题，并且具有危险的误导性。

job: the whole execution of a query/rdd write作业：查询/rdd 写入的整个执行
task attempts perform work on different processes;任务尝试在不同的进程上执行工作； generating output locally.在本地生成 output。 TAs may fail, so this is done in isolation. TA 可能会失败，因此这是孤立完成的。
task commit promotes the work of a task attempt to that of a job.任务提交将任务尝试的工作提升为作业的工作。 MUST be atomic, so if a task fails midway through (or worse, partitions from the spark driver but keeps going), another task attempt may be executed and committed.必须是原子的，所以如果一个任务在中途失败（或者更糟的是，来自 spark 驱动程序的分区但继续进行），另一个任务尝试可能会被执行并提交。
job commit takes all the work of committed tasks (and nothing from uncommitted/failed tasks) and promotes the final dir. job commit接受已提交任务的所有工作（未提交/失败任务中没有任何工作）并提升最终目录。

The v1 commit algorithm consists of v1 提交算法包括

Task commit: rename task attempt src tree to a job attempt dir (under _tmp/).任务提交：将任务尝试 src 树重命名为作业尝试目录（在 _tmp/ 下）。 relies on dir rename being (1) atomic and (2) fast.依赖于 dir 重命名是 (1) 原子和 (2) 快速。 Neither requirement is met for s3 s3 的要求都不满足
Job commit.工作提交。 List all task attempt directory trees, rename dirs/files to destination, one by one.列出所有任务尝试目录树，将目录/文件重命名为目标，一个接一个。 Expects listing and rename to be fast.期望上市和重命名是快速的。

The v2 commit algorithm is v2 提交算法是

Task commit: list of all files in task attempt src tree, and rename one by one to dest dir. Task commit：列出task attempt src tree中的所有文件，并一一重命名为dest dir。 This is not atomic, and does not meet spark's requirements.这不是原子的，不符合spark的要求。
job commit.工作承诺。 write 0 byte _SUCCESS file.写入 0 字节 _SUCCESS 文件。

The blog author is correct in noting that v1 is slow in job commit.博客作者指出 v1 的作业提交速度很慢是正确的。 It's real issue is not performance though, it is correctness due to task commit not being atomic on s3.真正的问题不是性能，而是正确性，因为任务提交在 s3 上不是原子的。

However, v2 is incorrect everywhere , even on hdfs, because the v2 task commit is non-atomic.但是，v2到处都是不正确的，即使在 hdfs 上也是如此，因为 v2 任务提交是非原子的。 Which is why, even if faster, you shouldn't use it.这就是为什么即使速度更快，您也不应该使用它。 Anywhere.任何地方。 Really.真的。

For s3 then, if you want to write data into classic "directory tree" layouts那么对于 s3，如果你想将数据写入经典的“目录树”布局

ASF spark/hadoop releases: use the s3a committers built into recent hadoop-aws versions. ASF spark/hadoop 版本：使用内置于最新 hadoop-aws 版本中的 s3a 提交者。 read the hadoop docs to see how.阅读 hadoop 文档以了解操作方法。
EMR use the EMR S3 committer. EMR 使用 EMR S3 提交器。

Both of these avoid renames by writing the files to the final destination as S3 multipart uploads, but not finishing the uploads until job commit .这两者都通过将文件作为 S3 分段上传写入最终目的地来避免重命名，但直到作业提交才完成上传。 This makes job commit faster as it is nothing but listing/loading the single manifest file created by each task attempt (which lists its incomplete uploads), then POSTing the completion.这使得作业提交更快，因为它只不过是列出/加载由每个任务尝试创建的单个清单文件（其中列出了其未完成的上传），然后发布完成。 No renames, and ask task commit is a PUT of a JSON file, fast and atomic.没有重命名，并要求任务提交是一个 JSON 文件的 PUT，快速且原子。

If true, same for ADLS2?如果为真，ADLS2 也一样吗？

v1 works there though as listing is slow and rename not great, it is a bit slower than HDFS. It can throttle under the load of a job commit with the odd "quirky" failure wherein renames are reported as 503/throttle but in fact take place...this complicates revoer. v1 在那里工作，虽然列表很慢并且重命名不是很好，它比 HDFS 慢一点。它可以在作业提交的负载下进行限制，并出现奇怪的“古怪”失败，其中重命名被报告为 503/throttle 但实际上采取地方...这使 revoer 变得复杂。

Hadoop 3.3.5+ adds an Intermediate Manifest committer for performance on Azure and Google GCS. Hadoop 3.3.5+ 添加了一个 Intermediate Manifest committer 以提高 Azure 和 Google GCS 的性能。 These also commit work by writing a manifest in task commit.这些还通过在任务提交中编写清单来提交工作。 Job commit is parallelised list/load of these, then parallelized rename.作业提交是这些的并行化列表/加载，然后并行化重命名。 View it as a v3 commit algorithm.将其视为 v3 提交算法。

GCS: task commit becomes atomic and fast (its dir rename is nonatomic O(files), which is why v1 is unsafe) GCS：任务提交变得原子化且快速（其目录重命名为非原子 O(files)，这就是 v1 不安全的原因）
ABFS: does listing in task commit, so avoids it job commit (time, IOPs), rename is parallelised and yet rate limited (scale) and by recording the etags of source files, capable of recovering from throttle-related rename failure misreporting (ie if dest file exists and etag==source etag all is good) ABFS：确实在任务提交中列出，因此避免了作业提交（时间，IOP），重命名是并行的但速率有限（规模）并且通过记录源文件的 etags，能够从节流相关的重命名失败误报中恢复（即如果 dest 文件存在并且 etag==source etag 一切都很好）

Finally, there's cloud first formats: Iceberg, delta lake, Hudi, These commit jobs atomically by writing a single manifest file somewhere;最后，还有云优先格式：Iceberg、delta lake、Hudi，这些通过在某处编写单个清单文件以原子方式提交作业； query planning becomes the work of listing/loading the chain of manifest files, so identifying data files to process.查询计划成为列出/加载清单文件链的工作，从而识别要处理的数据文件。 These are broadly recognised by everyone who works in the problem of spark/hive cloud performance as the future.这些被所有从事spark/hive云性能问题工作的人广泛认可为未来。 If you can use those your life is better.如果你能使用这些，你的生活会更好。

Further reading:延伸阅读：

The whole mechanism for committing work to persistent storage in the presence of failures is a fascinating piece of distributed computing.在出现故障时将工作提交到持久存储的整个机制是分布式计算的一个迷人部分。 If you read the Zero Rename Committer paper, the final chapter actually discusses where things still went wrong in production.如果您阅读了零重命名提交者论文，最后一章实际上讨论了生产中仍然出错的地方。 This is a better read in hindsight than it was at the time.事后看来，这是比当时更好的读物。 Everyone should document their production problems.每个人都应该记录他们的生产问题。