简体   繁体   English

在 aws 胶水 pyspark 中设置火花配置

[英]Set spark configuration in aws glue pyspark

I am using AWS Glue with pySpark and want to add a couple of configurations in the sparkSession, eg '"spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem" , spark.hadoop.fs.s3a.multiobjectdelete.enable","false" , "spark.serializer", "org.apache.spark.serializer.KryoSerializer" , "spark.hadoop.fs.s3a.fast.upload","true" . The code I am using to initialise the context is the following:我正在使用带有 pySpark 的 AWS Glue 并希望在 sparkSession 中添加几个配置,例如'"spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem" , spark.hadoop.fs.s3a.multiobjectdelete.enable","false" , "spark.serializer", "org.apache.spark.serializer.KryoSerializer" , "spark.hadoop.fs.s3a.fast.upload","true" 。我用来初始化上下文的代码如下:

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

From what I understood from the documentation is that I should add these confs as job parameters when submitting the glue jobs.从我从文档中了解到的是,在提交胶水作业时,我应该将这些 confs 添加为作业参数。 Is that the case or can they also be added when initializing the spark?是这种情况还是可以在初始化火花时添加它们?

This doesn't seem to be erroring out - not sure if it's working这似乎没有出错 - 不确定它是否有效

hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("spark.hadoop.fs.s3.maxRetries", "20")
hadoop_conf.set("spark.hadoop.fs.s3.consistent.retryPolicyType", "exponential")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM