Read from s3 via spark from outside AWS using temp credentials

Question

I'm trying to read a file from s3 from my laptop via IntelliJ, so that I can develop my spark job easier.

The textFile RDD code works in Zeppelin within an EMR cluster, but not when I try locally.

In Zeppelin I didn't need to do any spark context set up, presumably because it does it for me as the Zeppelin instance is inside the AWS environment.

I've written code to create temp AWS credentials (using my IAM user keys) so that I can provide a session token to the spark context. The access key and secret key are also from the temp credentials.

val sqlContext = sparkSession.sqlContext

        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.awsAccessKeyId", accessKeyId)
        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.access.key", accessKeyId)
        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.awsSecretAccessKey", secretAccessKey)
        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.secret.key", secretAccessKey)
        sqlContext.sparkContext.hadoopConfiguration
          .set("fs.s3a.session.token", sessionToken)
        sqlContext.sparkContext.hadoopConfiguration.set("fs.s3a.credentialsType", "AssumeRole")
        sqlContext.sparkContext.hadoopConfiguration
          .set(
            "fs.s3a.stsAssumeRole.arn",
            "arn:aws:iam::1234:role/someRoleThatWasUsedInTheWorkingTempCredCode"
          )

        sqlContext.sparkContext.hadoopConfiguration.set(
          "fs.s3a.aws.credentials.provider",
          "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"
        )

sqlContext.sparkContext.textFile(
          "s3a://path/to/file/that/definitely/exists/3714bb50a146.gz"
        ).collect()

I was expecting an array with data from the file, instead I get permission denied.

org.apache.hadoop.security.AccessControlException: Permission denied: s3n://path/to/file/that/definitely/exists/3714bb50a146.gz
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:449)
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)

Two questions:

1) Is what I'm doing possible (executing a spark job that reads from s3 locally)?

2) If what I am doing is possible, is my spark context set up code valid? I feel like I'm missing a property or using a wrong property key.

Answer 1

get rid of that line about fs.s3a.impl. All it does is change the default mapping of "s3a" to "the modern, supported, maintained S3A connector" to the "old, obsolete, unsupported S3N connector"

you do not need that line. The fact that people writing spark apps always do this is just superstition. Hadoop-common knows which filesystem class handles s3a URLs the same way it knows who handles "file" and "hdfs"

Read from s3 via spark from outside AWS using temp credentials

Question

1 answers

solution1
1 2019-04-04 20:42:35

Read from s3 via spark from outside AWS using temp credentials

Question

1 answers

solution1 1 2019-04-04 20:42:35

solution1
1 2019-04-04 20:42:35