简体   繁体   English

通过使用临时凭证从AWS外部通过Spark从s3中读取

[英]Read from s3 via spark from outside AWS using temp credentials

I'm trying to read a file from s3 from my laptop via IntelliJ, so that I can develop my spark job easier. 我正在尝试通过IntelliJ从笔记本电脑读取s3中的文件,以便可以更轻松地开发我的spark工作。

The textFile RDD code works in Zeppelin within an EMR cluster, but not when I try locally. textFile RDD代码可在EMR集群内的Zeppelin中使用,但是当我在本地尝试时无法使用。

In Zeppelin I didn't need to do any spark context set up, presumably because it does it for me as the Zeppelin instance is inside the AWS environment. 在Zeppelin中,我不需要进行任何Spark上下文设置,大概是因为Zeppelin实例在AWS环境中,所以它为我完成了设置。

I've written code to create temp AWS credentials (using my IAM user keys) so that I can provide a session token to the spark context. 我已经编写了用于创建临时AWS凭证的代码(使用IAM用户密钥),以便可以向Spark上下文提供会话令牌。 The access key and secret key are also from the temp credentials. 访问密钥和秘密密钥也来自临时凭证。

val sqlContext = sparkSession.sqlContext

        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.awsAccessKeyId", accessKeyId)
        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.access.key", accessKeyId)
        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.awsSecretAccessKey", secretAccessKey)
        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.secret.key", secretAccessKey)
        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.session.token", sessionToken)
        sqlContext.sparkContext.hadoopConfiguration.set("fs.s3a.credentialsType", "AssumeRole")
        sqlContext.sparkContext.hadoopConfiguration
          .set(
            "fs.s3a.stsAssumeRole.arn",
            "arn:aws:iam::1234:role/someRoleThatWasUsedInTheWorkingTempCredCode"
          )

        sqlContext.sparkContext.hadoopConfiguration.set(
          "fs.s3a.aws.credentials.provider",
          "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"
        )

sqlContext.sparkContext.textFile(
          "s3a://path/to/file/that/definitely/exists/3714bb50a146.gz"
        ).collect()

I was expecting an array with data from the file, instead I get permission denied. 我原本期望一个包含文件中数据的数组,但我拒绝了权限。

org.apache.hadoop.security.AccessControlException: Permission denied: s3n://path/to/file/that/definitely/exists/3714bb50a146.gz
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:449)
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)

Two questions: 两个问题:

1) Is what I'm doing possible (executing a spark job that reads from s3 locally)? 1)我正在做什么(执行从本地s3读取的spark作业)?

2) If what I am doing is possible, is my spark context set up code valid? 2)如果我正在做的事情是可能的,我的Spark上下文设置代码是否有效? I feel like I'm missing a property or using a wrong property key. 我觉得我缺少属性或使用了错误的属性密钥。

get rid of that line about fs.s3a.impl. 摆脱关于fs.s3a.impl的那条线。 All it does is change the default mapping of "s3a" to "the modern, supported, maintained S3A connector" to the "old, obsolete, unsupported S3N connector" 它所做的只是将默认映射从“ s3a”更改为“现代的,受支持的,维护的S3A连接器”,将其更改为“旧的,过时的,不受支持的S3N连接器”

you do not need that line. 您不需要那条线。 The fact that people writing spark apps always do this is just superstition. 人们编写Spark应用程序总是这样做,这只是迷信。 Hadoop-common knows which filesystem class handles s3a URLs the same way it knows who handles "file" and "hdfs" Hadoop常见者知道哪个文件系统类处理s3a URL的方式与知道谁处理“文件”和“ hdfs”的方式相同

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM