[英]Magic committer not improving performance in a Spark3+Yarn3+S3 setup
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes.我正在尝试为我在 Yarn (Hadoop 3.3.1) 集群上运行的 Spark3.3.0 应用程序启用S3A 魔术提交程序,以查看我的应用程序在 S3 写入期间的性能改进。 IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
IIUC,我的 Spark 应用程序在相应的 Spark 阶段写入了大约 21GB 的数据和 30 个任务(见下图)。
I have a server which has the Spark client.我有一台带有 Spark 客户端的服务器。 The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
Spark 客户端使用 PySpark 通过客户端模式在 Yarn 集群上提交应用程序。
I am using the following config (setting via PySpark Spark-conf) to enable the committer:我正在使用以下配置(通过 PySpark Spark-conf 设置)来启用提交者:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/
directory of the Spark-Home on the Nodemanagers and my Spark-client servers.我还将spark-hadoop-cloud jar 下载到 Nodemanagers 和我的 Spark 客户端服务器上 Spark-Home 的
jars/
目录。
I do see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
我确实看到警告
WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
going away after I apply the above configs.应用上述配置后消失。 Hence, I believe my configs are getting applied correctly.
因此,我相信我的配置得到了正确应用。
I have read in multiple articles that this committer is expected to show a performance boost (eg this article claims 57-77% time reduction).我在多篇文章中读到,该提交者有望显示出性能提升(例如,这篇文章声称减少了 57-77% 的时间)。 Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
因此,当我使用上述共享配置时,我希望在我的“paruqet”阶段的“持续时间”列中看到显着减少(从 39 秒开始)。
"spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol"
, my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
."spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol"
时,我的应用程序失败并显示错误java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
.PRE __magic/
directory if I run aws s3 ls <write-path>
when the job is running.aws s3 ls <write-path>
,我会看到PRE __magic/
目录。 grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.获取你可以获得的最新 spark+hadoop 构建,总是有持续的改进,hadoop 3.3.5 在那里做了很大的增强。
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data).与 v1 提交者相比,您应该看到性能改进,提交速度为 O(files) 而不是 O(data)。 it is also correct , which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere
它也是正确的,v1 算法在 s3 上不提供(并且 v2 在任何地方都不提供
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.