将 Spark fileoutputcommitter.algorithm.version=2 与 AWS Glue 结合使用

Question

I haven't been able to figure this out, but I'm trying to use a direct output committer with AWS Glue:我一直无法弄清楚这一点，但我正在尝试将直接输出提交者与 AWS Glue 结合使用：

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

Is it possible to use this configuration with AWS Glue?是否可以将此配置与 AWS Glue 一起使用？

Answer 1

Option 1 :选项1 ：

Glue uses spark context you can set hadoop configuration to aws glue as well. Glue 使用 spark 上下文，您也可以将 hadoop 配置设置为 aws 胶水。 since internally dynamic frame is kind of dataframe.因为内部动态帧是一种数据帧。

sc._jsc.hadoopConfiguration().set("mykey","myvalue")

I think you neeed to add the correspodning class also like this我认为您也需要像这样添加相应的类

sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")

example snippet :示例片段：

 sc = SparkContext()

    sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")

    glueContext = GlueContext(sc)
    spark = glueContext.spark_session

To prove that that configuration exists ....为了证明该配置存在......

Debug in python :在 python 中调试：

sc._conf.getAll() // print this

Debug in scala :在 Scala 中调试：

sc.getConf.getAll.foreach(println)

Option 2:选项 2：

Other side you try using job parameters of the glue :另一方面，您尝试使用胶水的作业参数：

https://docs.aws.amazon.com/glue/latest/dg/add-job.html which has key value properties like mentioned in docs https://docs.aws.amazon.com/glue/latest/dg/add-job.html具有文档中提到的键值属性

'--myKey' : 'value-for-myKey'

you can follow below screen shot for editing job and specifying the parameters with --conf您可以按照以下屏幕截图编辑作业并使用--conf指定参数

Option 3:选项 3：
If you are using, aws cli you can try below... https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html如果您正在使用 aws cli，您可以在下面尝试... https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

Fun is they mentioned in the docs dont set message like below.有趣的是，他们在文档中提到不要设置如下消息。 but i dont know why it was exposed.但我不知道为什么它被暴露了。

To sum up : I personally prefer option1 since you have programmatic control.总结一下：我个人更喜欢option1，因为你有程序控制。

Answer 2

Go to glue job console and edit your job as follows :转到粘合作业控制台并按如下方式编辑您的作业：

Glue> Jobs > Edit your Job> Script libraries and job parameters (optional) > Job parameters胶水> 作业> 编辑您的作业> 脚本库和作业参数（可选）> 作业参数

Set the following:设置以下内容：

key: --conf value:键：--conf 值：

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

将 Spark fileoutputcommitter.algorithm.version=2 与 AWS Glue 结合使用

问题描述

2 个解决方案

解决方案1
3 已采纳 2019-06-03 18:59:26

解决方案2
0 2019-06-05 08:20:18

将 Spark fileoutputcommitter.algorithm.version=2 与 AWS Glue 结合使用

问题描述

2 个解决方案

解决方案1 3 已采纳 2019-06-03 18:59:26

解决方案2 0 2019-06-05 08:20:18

解决方案1
3 已采纳 2019-06-03 18:59:26

解决方案2
0 2019-06-05 08:20:18