简体   繁体   中英

Set spark configuration in aws glue pyspark

I am using AWS Glue with pySpark and want to add a couple of configurations in the sparkSession, eg '"spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem" , spark.hadoop.fs.s3a.multiobjectdelete.enable","false" , "spark.serializer", "org.apache.spark.serializer.KryoSerializer" , "spark.hadoop.fs.s3a.fast.upload","true" . The code I am using to initialise the context is the following:

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

From what I understood from the documentation is that I should add these confs as job parameters when submitting the glue jobs. Is that the case or can they also be added when initializing the spark?

This doesn't seem to be erroring out - not sure if it's working

hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("spark.hadoop.fs.s3.maxRetries", "20")
hadoop_conf.set("spark.hadoop.fs.s3.consistent.retryPolicyType", "exponential")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM