[英]Spark i/o with S3
Reading this below from https://blog.duyet.net/2021/04/spark-kube.netes-performance-tuning.html阅读以下内容 https://blog.duyet.net/2021/04/spark-kube.netes-performance-tuning.html
I/O with S3
使用 S3 的 I/O
It's longer time to append data to an existing dataset and in particular, all of Spark jobs have finished, but your command has not finished, it is because driver node is moving the output files of tasks from the job temporary directory to the final destination one-by-one , which is slow with cloud storage (eg S3).
append 数据到现有数据集的时间更长,特别是,所有 Spark 作业都已完成,但您的命令尚未完成,这是因为驱动程序节点正在将任务的 output 文件从作业临时目录移动到最终目标目录-by-one ,这在云存储(例如 S3)中很慢。
Enable this optimization: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
启用此优化:spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
I am wanting to check if the bold statement is true.我想检查大胆的说法是否属实。 I have never heard that the Spark Driver writes files / controls writing with S3.
我从来没有听说过 Spark Driver 用 S3 写文件/控制写。 Sure, not an HDFS Cluster, and Spark Driver does work necessarily on reading from S3.
当然,不是 HDFS 集群,而且 Spark Driver 确实需要从 S3 读取数据。 My knowledge is that the Executors write the data to data at rest, or KAFKA, even if running Spark on AWS.
我的知识是执行器将数据写入 rest 或 KAFKA 的数据,即使在 AWS 上运行 Spark 也是如此。 But, presumably I am wrong, or not?
但是,大概我错了,或者不是?
If true, same for ADLS2?如果为真,ADLS2 也一样吗?
The comment "I have faced the same issue, and I found It was quicker to write the content on a temporary HDFS directory and the move the content with a command such as s3-dist-cp to S3" is not what I am asking about.评论“我遇到了同样的问题,我发现将内容写入临时 HDFS 目录并使用诸如 s3-dist-cp 之类的命令将内容移动到 S3 更快”不是我要问的.
Whoever wrote that post does not full understand the whole job commit problem and is dangerously misleading.写那篇文章的人并没有完全理解整个工作提交问题,并且具有危险的误导性。
The v1 commit algorithm consists of v1 提交算法包括
The v2 commit algorithm is v2 提交算法是
The blog author is correct in noting that v1 is slow in job commit.博客作者指出 v1 的作业提交速度很慢是正确的。 It's real issue is not performance though, it is correctness due to task commit not being atomic on s3.
真正的问题不是性能,而是正确性,因为任务提交在 s3 上不是原子的。
However, v2 is incorrect everywhere , even on hdfs, because the v2 task commit is non-atomic.但是,v2到处都是不正确的,即使在 hdfs 上也是如此,因为 v2 任务提交是非原子的。 Which is why, even if faster, you shouldn't use it.
这就是为什么即使速度更快,您也不应该使用它。 Anywhere.
任何地方。 Really.
真的。
For s3 then, if you want to write data into classic "directory tree" layouts那么对于 s3,如果你想将数据写入经典的“目录树”布局
Both of these avoid renames by writing the files to the final destination as S3 multipart uploads, but not finishing the uploads until job commit .这两者都通过将文件作为 S3 分段上传写入最终目的地来避免重命名,但直到作业提交才完成上传。 This makes job commit faster as it is nothing but listing/loading the single manifest file created by each task attempt (which lists its incomplete uploads), then POSTing the completion.
这使得作业提交更快,因为它只不过是列出/加载由每个任务尝试创建的单个清单文件(其中列出了其未完成的上传),然后发布完成。 No renames, and ask task commit is a PUT of a JSON file, fast and atomic.
没有重命名,并要求任务提交是一个 JSON 文件的 PUT,快速且原子。
If true, same for ADLS2?
如果为真,ADLS2 也一样吗?
v1 works there though as listing is slow and rename not great, it is a bit slower than HDFS. It can throttle under the load of a job commit with the odd "quirky" failure wherein renames are reported as 503/throttle but in fact take place...this complicates revoer. v1 在那里工作,虽然列表很慢并且重命名不是很好,它比 HDFS 慢一点。它可以在作业提交的负载下进行限制,并出现奇怪的“古怪”失败,其中重命名被报告为 503/throttle 但实际上采取地方...这使 revoer 变得复杂。
Hadoop 3.3.5+ adds an Intermediate Manifest committer for performance on Azure and Google GCS. Hadoop 3.3.5+ 添加了一个 Intermediate Manifest committer 以提高 Azure 和 Google GCS 的性能。 These also commit work by writing a manifest in task commit.
这些还通过在任务提交中编写清单来提交工作。 Job commit is parallelised list/load of these, then parallelized rename.
作业提交是这些的并行化列表/加载,然后并行化重命名。 View it as a v3 commit algorithm.
将其视为 v3 提交算法。
Finally, there's cloud first formats: Iceberg, delta lake, Hudi, These commit jobs atomically by writing a single manifest file somewhere;最后,还有云优先格式:Iceberg、delta lake、Hudi,这些通过在某处编写单个清单文件以原子方式提交作业; query planning becomes the work of listing/loading the chain of manifest files, so identifying data files to process.
查询计划成为列出/加载清单文件链的工作,从而识别要处理的数据文件。 These are broadly recognised by everyone who works in the problem of spark/hive cloud performance as the future.
这些被所有从事spark/hive云性能问题工作的人广泛认可为未来。 If you can use those your life is better.
如果你能使用这些,你的生活会更好。
Further reading:延伸阅读:
The whole mechanism for committing work to persistent storage in the presence of failures is a fascinating piece of distributed computing.在出现故障时将工作提交到持久存储的整个机制是分布式计算的一个迷人部分。 If you read the Zero Rename Committer paper, the final chapter actually discusses where things still went wrong in production.
如果您阅读了零重命名提交者论文,最后一章实际上讨论了生产中仍然出错的地方。 This is a better read in hindsight than it was at the time.
事后看来,这是比当时更好的读物。 Everyone should document their production problems.
每个人都应该记录他们的生产问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.