简体   繁体   English

无法让 Spark 将神奇的 output committer 用于带有 EMR 的 s3

[英]Can't get Spark to use the magic output committer for s3 with EMR

I'm trying to use the magic output committer, But whatever I do I get the default output committer.我正在尝试使用神奇的 output 提交者,但无论我做什么,我都会得到默认的 output 提交者。

INFO FileOutputCommitter: File Output Committer Algorithm version is 10
22/03/08 01:13:06 ERROR Application: Only 1 or 2 algorithm version is supported

This is how I know I'm using it according to Hadoop docs .这就是我根据Hadoop 文档知道我正在使用它的方式。 What am I doing wrong?我究竟做错了什么? this is my relevant conf (Using SparkConf() ), I tried many others.这是我的相关 conf(使用SparkConf() ),我尝试了很多其他的。

  .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
  .set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "10")
  .set("spark.hadoop.fs.s3a.committer.magic.enabled", "true")
  .set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
  .set("fs.s3a.committer.name", "magic")
  .set("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
  .set("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")

I do not have any other configuration relevant to that.我没有与此相关的任何其他配置。 Not in code or conf files (Hadoop or Spark), maybe I should?不在代码或 conf 文件(Hadoop 或 Spark)中,也许我应该? The pathes I'm writing to starts with s3://.我写的路径以 s3:// 开头。 Using Hadoop 3.2.1, Spark 3.0.0 and EMR 6.1.1使用 Hadoop 3.2.1、Spark 3.0.0 和 EMR 6.1.1

So After a lot of reading + stevel comment, I found what I need.所以经过大量阅读+史蒂夫评论,我找到了我需要的东西。 I'm using the optimized output committer which is built-in EMR and used by default.我正在使用优化的 output 提交程序,它是内置 EMR 并默认使用。 The reason I didn't use it at first was that the AWS optimized committer is activated only when it can.起初我没有使用它的原因是 AWS 优化的提交器只有在它可以的时候才会被激活。 Until EMR 6.4.0 it worked only on some conditions but from 6.4.0 it works on every write type txt csv parquet and with rdd datagram and dataset.在 EMR 6.4.0 之前,它只适用于某些条件,但从 6.4.0 开始,它适用于每种写入类型 txt csv parquet 以及 rdd 数据报和数据集。 So I was just needed to update to EMR 6.4.0.所以我只需要更新到 EMR 6.4.0。

There was an improvement of 50-60 percent in execution time.执行时间缩短了 50-60%。

The optimized committer requeirments .优化的提交者要求

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM