简体   繁体   English

将 Spark fileoutputcommitter.algorithm.version=2 与 AWS Glue 结合使用

[英]Use Spark fileoutputcommitter.algorithm.version=2 with AWS Glue

I haven't been able to figure this out, but I'm trying to use a direct output committer with AWS Glue:我一直无法弄清楚这一点,但我正在尝试将直接输出提交者与 AWS Glue 结合使用:

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

Is it possible to use this configuration with AWS Glue?是否可以将此配置与 AWS Glue 一起使用?

Option 1 :选项1 :

Glue uses spark context you can set hadoop configuration to aws glue as well. Glue 使用 spark 上下文,您也可以将 hadoop 配置设置为 aws 胶水。 since internally dynamic frame is kind of dataframe.因为内部动态帧是一种数据帧。

sc._jsc.hadoopConfiguration().set("mykey","myvalue")

I think you neeed to add the correspodning class also like this我认为您也需要像这样添加相应的类

sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")

example snippet :示例片段:

 sc = SparkContext()

    sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")

    glueContext = GlueContext(sc)
    spark = glueContext.spark_session

To prove that that configuration exists ....为了证明该配置存在......

Debug in python :在 python 中调试:

sc._conf.getAll() // print this

Debug in scala :在 Scala 中调试:

sc.getConf.getAll.foreach(println)

Option 2:选项 2:

Other side you try using job parameters of the glue :另一方面,您尝试使用胶水的作业参数:

https://docs.aws.amazon.com/glue/latest/dg/add-job.html which has key value properties like mentioned in docs https://docs.aws.amazon.com/glue/latest/dg/add-job.html具有文档中提到的键值属性

'--myKey' : 'value-for-myKey'  

you can follow below screen shot for editing job and specifying the parameters with --conf您可以按照以下屏幕截图编辑作业并使用--conf指定参数

在此处输入图片说明

Option 3:选项 3:
If you are using, aws cli you can try below... https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html如果您正在使用 aws cli,您可以在下面尝试... https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

Fun is they mentioned in the docs dont set message like below.有趣的是,他们在文档中提到不要设置如下消息。 but i dont know why it was exposed.但我不知道为什么它被暴露了。

在此处输入图片说明

To sum up : I personally prefer option1 since you have programmatic control.总结一下:我个人更喜欢option1,因为你有程序控制。

Go to glue job console and edit your job as follows :转到粘合作业控制台并按如下方式编辑您的作业:

Glue> Jobs > Edit your Job> Script libraries and job parameters (optional) > Job parameters胶水> 作业> 编辑您的作业> 脚本库和作业参数(可选)> 作业参数

Set the following:设置以下内容:

key: --conf value:键:--conf 值:

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法使用 Apache Spark 在 AWS Glue 中读取 json 个文件 - Unable to read json files in AWS Glue using Apache Spark 在 Spark 中编写 AVRO 文件时应该使用什么 FileOutputCommitter? - What FileOutputCommitter should be used in when writing AVRO files in Spark? 如何在AWS Glue中使用Jackson依赖项 - How to use Jackson dependency with AWS Glue AWS Glue Spark 作业书签会重新处理失败的作业吗? - Will AWS Glue Spark Job Bookmark reprocess failed jobs? AWS EMR 上的 Spark .saveTableAs 写入 Glue Catalog 失败 - Spark .saveTableAs on AWS EMR to write to Glue Catalog is failing 如何在 Spark 中的 AWS Glue 创建的数据帧上运行 SQL SELECT? - How do I run SQL SELECT on AWS Glue created Dataframe in Spark? AWS Glue Spark作业-使用CatalogSource时如何对S3输入文件进行分组? - AWS Glue Spark Job - How to group S3 input files when using CatalogSource? 如何为Scala Spark ETL设置本地开发环境以在AWS Glue中运行? - How to set up a local development environment for Scala Spark ETL to run in AWS Glue? 是否可以使用 AWS/Glue/Spark shell 写入 RDS 原始 sql (PostgreSQL)? - Is it possible writing down to RDS raw sql (PostgreSQL) using AWS/Glue/Spark shell? AWS EMR 上的 Spark:使用更多执行程序 - Spark on AWS EMR: use more executors
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM