如何使用新的 Hadoop parquet magic commiter 通过 Spark 自定义 S3 服务器

Question

I have spark 2.4.0 and Hadoop 3.1.1.我有 spark 2.4.0 和 Hadoop 3.1.1。 According to Hadoop Documentation , to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf :根据Hadoop 文档，为了使用允许将镶木地板文件一致地写入 S3 的新 Magic 提交器，我在conf/spark-default.conf设置了这些值：

spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class    org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a    org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name          magic
spark.hadoop.fs.s3a.committer.magic.enabled true

When using this configuration I end up with the exception:使用此配置时，我最终遇到了异常：

java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?我的问题是双重的，首先我是否正确理解 Hadoop 3.1.1 允许一致地将镶木地板文件写入 S3？
Second, if I did understand well, how to use the new committer properly from Spark ?其次，如果我理解得很好，如何从 Spark 正确使用新的提交者？

Answer 1

Edit:编辑：
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:好的，我有两个服务器一个实例现在有点旧了，我尝试使用带有这些参数的最新版本的 minio：

sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")

I'm able to write so far without trouble.到目前为止，我可以毫无困难地写作。
However my swift server which is a bit older with this config:但是，我的 swift 服务器使用此配置有点旧：

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

seems to not support properly the partionner.似乎没有正确支持合作伙伴。

Regarding "Hadoop S3guard":关于“Hadoop S3guard”：
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop.目前不可能，必须在 Hadoop 中启用保存 S3 文件元数据的 Hadoop S3guard。 The S3guard though rely on DynamoDB a proprietary Amazon service. S3guard 虽然依赖于 DynamoDB 一项专有的亚马逊服务。
There's no alternative now like a sqlite file or other DB system to store the metadata.现在别无选择，例如 sqlite 文件或其他数据库系统来存储元数据。
So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.因此，如果您将 S3 与minio或任何其他 S3 实现一起使用，那么您就缺少 DynamoDB。
This article explains nicely how works S3guard这篇文章很好地解释了 S3guard 的工作原理

Answer 2

Kiwy: that's my code: I can help you with this. Kiwy：这是我的代码：我可以帮你解决这个问题。 Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)一些类没有包含在 ASF spark 版本中，但是您可以在 Hadoop JAR 中找到，我可以尝试构建具有相关依赖项的 ASF 版本（我可以将它们放在下游；它们曾经在那里）

You do not need S3Guard turned on to use the "staging committer";您不需要打开 S3Guard 即可使用“暂存提交者”； it's only the "magic" variant which needs consistent object store listings during the commit phase.它只是在提交阶段需要一致的对象存储列表的“魔术”变体。

Answer 3

All the new committers config documentation I've read up to date, is missing one fundamental fact:我最新阅读的所有新提交者配置文档都缺少一个基本事实：

spark 2.xx does not have needed support classes to make new S3a committers to function. spark 2.xx 不需要支持类来使新的 S3a 提交者发挥作用。

They promise those cloud integration libs will be bundled with spark 3.0.0, but for now you have to add libraries yourself.他们承诺那些云集成库将与 spark 3.0.0 捆绑在一起，但现在您必须自己添加库。

Under the cloud integration maven repos there are multiple distributions supporting the committers, I found one working with directory committer but not the magic.在云集成maven repos 下，有多个支持提交者的发行版，我发现一个使用目录提交者但不是魔术。

In general the directory committer is the recommended over magic as it has been well tested and tried.一般来说，推荐目录提交者而不是魔法，因为它已经过很好的测试和尝试。 It requires shared filesystem (magic committer does not require one, but needs s3guard) such as HDFS or NFS (we use AWS EFS) to coordinate spark worker writes to S3.它需要共享文件系统（magic committer 不需要，但需要 s3guard），例如 HDFS 或 NFS（我们使用 AWS EFS）来协调 spark worker 写入 S3。

如何使用新的 Hadoop parquet magic commiter 通过 Spark 自定义 S3 服务器

问题描述

3 个解决方案

解决方案1
3 已采纳 2018-11-20 09:36:51

解决方案2
1 2018-11-20 15:00:33

解决方案3
0 2020-01-26 09:42:41

spark 2.xx does not have needed support classes to make new S3a committers to function. spark 2.xx 不需要支持类来使新的 S3a 提交者发挥作用。

如何使用新的 Hadoop parquet magic commiter 通过 Spark 自定义 S3 服务器

问题描述

3 个解决方案

解决方案1 3 已采纳 2018-11-20 09:36:51

解决方案2 1 2018-11-20 15:00:33

解决方案3 0 2020-01-26 09:42:41

spark 2.xx does not have needed support classes to make new S3a committers to function. spark 2.xx 不需要支持类来使新的 S3a 提交者发挥作用。

解决方案1
3 已采纳 2018-11-20 09:36:51

解决方案2
1 2018-11-20 15:00:33

解决方案3
0 2020-01-26 09:42:41