Magic committer 在 Spark3+Yarn3+S3 设置中没有提高性能

Question

What I am trying to achieve?我想要达到什么目的？

I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes.我正在尝试为我在 Yarn (Hadoop 3.3.1) 集群上运行的 Spark3.3.0 应用程序启用S3A 魔术提交程序，以查看我的应用程序在 S3 写入期间的性能改进。 IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image). IIUC，我的 Spark 应用程序在相应的 Spark 阶段写入了大约 21GB 的数据和 30 个任务（见下图）。

My setup我的设置

I have a server which has the Spark client.我有一台带有 Spark 客户端的服务器。 The Spark client submits the application on Yarn cluster via the client-mode with PySpark. Spark 客户端使用 PySpark 通过客户端模式在 Yarn 集群上提交应用程序。

What I tried我试过的

I am using the following config (setting via PySpark Spark-conf) to enable the committer:我正在使用以下配置（通过 PySpark Spark-conf 设置）来启用提交者：

"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"

I also downloaded the spark-hadoop-cloud jar to the jars/ directory of the Spark-Home on the Nodemanagers and my Spark-client servers.我还将spark-hadoop-cloud jar 下载到 Nodemanagers 和我的 Spark 客户端服务器上 Spark-Home 的jars/目录。

I do see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.我确实看到警告WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. going away after I apply the above configs.应用上述配置后消失。 Hence, I believe my configs are getting applied correctly.因此，我相信我的配置得到了正确应用。

What I expect我期待什么

I have read in multiple articles that this committer is expected to show a performance boost (eg this article claims 57-77% time reduction).我在多篇文章中读到，该提交者有望显示出性能提升（例如，这篇文章声称减少了 57-77% 的时间）。 Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.因此，当我使用上述共享配置时，我希望在我的“paruqet”阶段的“持续时间”列中看到显着减少（从 39 秒开始）。

Some other point that might be of value其他一些可能有价值的观点

When I use "spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol" , my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol .当我使用"spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol"时，我的应用程序失败并显示错误java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol .
I have not looked into enabling S3gaurd, as S3 now provides strong consistency .我没有考虑启用 S3gaurd，因为S3 现在提供了强一致性。
I see PRE __magic/ directory if I run aws s3 ls <write-path> when the job is running.如果我在作业运行时运行aws s3 ls <write-path> ，我会看到PRE __magic/目录。

Answer 1

correct.正确的。 you don't need s3guard你不需要 s3guard
the com.hortonworks binding was for the wip committer work. com.hortonworks 绑定用于 wip 提交者工作。 the binding classes for wiring up spark/parquet are all in spark-hadoop-cloud and have org.spark prefixes.连接 spark/parquet 的绑定类都在 spark-hadoop-cloud 中，并且有 org.spark 前缀。 you seem to be ok there你似乎还好
the simple test for what committer is live is to print the JSON _SUCCESS file.提交者是否在线的简单测试是打印 JSON _SUCCESS 文件。 If that is a 0 byte file, you are still using the old committer.如果那是一个 0 字节的文件，那么您仍在使用旧的提交程序。 it does sound like you are.听起来确实像你。

grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.获取你可以获得的最新 spark+hadoop 构建，总是有持续的改进，hadoop 3.3.5 在那里做了很大的增强。

you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data).与 v1 提交者相比，您应该看到性能改进，提交速度为 O(files) 而不是 O(data)。 it is also correct , which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere它也是正确的，v1 算法在 s3 上不提供（并且 v2 在任何地方都不提供

Magic committer 在 Spark3+Yarn3+S3 设置中没有提高性能

问题描述

What I am trying to achieve?我想要达到什么目的？

My setup我的设置

What I tried我试过的

What I expect我期待什么

Some other point that might be of value其他一些可能有价值的观点

1 个解决方案

解决方案1
2 2022-12-12 15:18:49

Magic committer 在 Spark3+Yarn3+S3 设置中没有提高性能

问题描述

What I am trying to achieve?我想要达到什么目的？

My setup我的设置

What I tried我试过的

What I expect我期待什么

Some other point that might be of value其他一些可能有价值的观点

1 个解决方案

解决方案1 2 2022-12-12 15:18:49

解决方案1
2 2022-12-12 15:18:49