简体   繁体   English

从 Glue 运行时在 2 个 AWS 账户之间写入时设置 S3 存储桶权限

[英]Setting S3 Bucket permissions when writing between 2 AWS Accounts while running from Glue

I have a scala jar which I am calling from AWS Glue job.我有一个从 AWS Glue 作业调用的 Scala jar。 My jar writes to write a DataFrame to an S3 bucket in another AWS account which has KMS Encryption turned on.我的 jar 写入数据帧到另一个启用了 KMS 加密的 AWS 账户中的 S3 存储桶。 I am able to write to the bucket but I am not able to add the destination bucket owner permission to access the files.我可以写入存储桶,但无法添加访问文件的目标存储桶所有者权限。 I can achieve this if simply use Glue Writer but with straight Spark, it just not work.如果简单地使用 Glue Writer 但直接使用 Spark,我就可以实现这一点,它就是行不通。 I have read all the documentation and I am setting following bucket policies in hadoop configuration.我已阅读所有文档,并且正在 hadoop 配置中设置以下存储桶策略。

def writeDataFrameInTargetLocation( sparkContext:SparkContext = null, dataFrame: DataFrame, location: String, fileFormat: String,saveMode:String,encryptionKey:Option[String] = Option.empty,kms_region:Option[String]=Option("us-west-2")): Unit = { def writeDataFrameInTargetLocation( sparkContext:SparkContext = null, dataFrame: DataFrame, location: String, fileFormat: String,saveMode:String,encryptionKey:Option[String] = Option.empty,kms_region:Option[String]=Option("us-west- 2")): 单位 = {

if(encryptionKey.isDefined) { val region = if(kms_region.isDefined) kms_region.getOrElse("us-west-2") else "us-west-2" if(encryptionKey.isDefined) { val region = if(kms_region.isDefined) kms_region.getOrElse("us-west-2") else "us-west-2"

    sparkContext.hadoopConfiguration.set("fs.s3.enableServerSideEncryption", "false")
    sparkContext.hadoopConfiguration.set("fs.s3.cse.enabled", "true")
    sparkContext.hadoopConfiguration.set("fs.s3.cse.encryptionMaterialsProvider", "com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider")
    sparkContext.hadoopConfiguration.set("fs.s3.cse.kms.keyId", encryptionKey.get) // KMS key to encrypt the data with
      sparkContext.hadoopConfiguration.set("fs.s3.cse.kms.region", region) // the region for the KMS key
    sparkContext.hadoopConfiguration.set("fs.s3.canned.acl", "BucketOwnerFullControl")
    sparkContext.hadoopConfiguration.set("fs.s3.acl.default", "BucketOwnerFullControl")
    sparkContext.hadoopConfiguration.set("fs.s3.acl", "bucket-owner-full-control")
    sparkContext.hadoopConfiguration.set("fs.s3.acl", "BucketOwnerFullControl")
  }
  else {
    sparkContext.hadoopConfiguration.set("fs.s3.canned.acl", "BucketOwnerFullControl")
    sparkContext.hadoopConfiguration.set("fs.s3.acl.default", "BucketOwnerFullControl")
    sparkContext.hadoopConfiguration.set("fs.s3.acl", "bucket-owner-full-control")
    sparkContext.hadoopConfiguration.set("fs.s3.acl", "BucketOwnerFullControl")
  }

    val writeDF = dataFrame
      .repartition(5)
      .write

    
      writeDF
        .mode(saveMode)
        .option(Header, true)
        .format(fileFormat)
        .save(location)
    }

You are probably using the S3AFileSystem implementation for the " s3 " scheme (ie URLs of the form " s3://... ").您可能正在将 S3AFileSystem 实现用于“ s3 ”方案(即“ s3://... ”形式的 URL)。 You can check that by looking at sparkContext.hadoopConfiguration.get("fs.s3.impl") .您可以通过查看sparkContext.hadoopConfiguration.get("fs.s3.impl")来检查。 If that is the case, then you actually need to set the hadoop properties for " fs.s3a.* " not " fs.s3.* ".如果是这种情况,那么您实际上需要为“ fs.s3a.* ”而不是“ fs.s3.* ”设置hadoop属性。

Then the correct settings would be:那么正确的设置是:

sparkContext.hadoopConfiguration.set("fs.s3a.canned.acl", "BucketOwnerFullControl")
sparkContext.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")

The S3AFileSystem implementation is not using any of the properties under " fs.s3 ". S3AFileSystem实现不使用“ fs.s3 ”下的任何属性。 You can see that by investigating the code related to the following hadoop source code link: https://github.com/apache/hadoop/blob/43e8ac60971323054753bb0b21e52581f7996ece/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java#L268您可以通过调查与以下 hadoop 源代码链接相关的代码看到: https : //github.com/apache/hadoop/blob/43e8ac60971323054753bb0b21e52581f7996ece/hadoop-tools/hadoop-aws/src/main/java/org/apache /hadoop/fs/s3a/Constants.java#L268

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 设置 AWS S3 存储桶权限 - Setting up AWS S3 bucket permissions 从 S3 存储桶读取的 AWS Glue 限制数据 - AWS Glue Limit data read from S3 Bucket AWS Glue 作业在写入 S3 时被拒绝访问 - AWS Glue Job getting Access Denied when writing to S3 运行 AWS Glue ETL 作业并命名 output 文件名时,有没有办法从 S3 存储桶读取文件名。 pyspark 是否提供了一种方法来做到这一点? - Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it? 写入 aws s3 存储桶时 Spark 作业失败 - - Spark job failing while writing to aws s3 bucket - 存储桶中的AWS S3 object,覆盖时删除权限 - AWS S3 object in bucket, removing permissions when overwriting 如何使用 AWS Glue 从 S3 存储桶合并 CSV 文件并将其保存回 S3 - How to merge CSV file from S3 bucket and save it back into S3 using AWS Glue 从Linux同步到AWS S3 Bucket时保留所有者和文件权限信息 - Retain owner and file permissions info when syncing to AWS S3 Bucket from Linux 通过 pyspark 胶水作业写入镶木地板文件时,s3 存储桶被删除 - s3 bucket is getting deleted while writing parquet files via pyspark glue job 在 CloudFront 的帐户之间共享 S3 存储桶 - Sharing S3 Bucket between Accounts for CloudFront
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM