无法使用spark读取s3存储桶

Question

val spark = SparkSession
        .builder()
        .appName("try1")
        .master("local")
        .getOrCreate()

val df = spark.read
        .json("s3n://BUCKET-NAME/FOLDER/FILE.json")
        .select($"uid").show(5)

I have given the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY as environment variables. 我已将AWS_ACCESS_KEY_ID，AWS_SECRET_ACCESS_KEY作为环境变量。 I face below error while trying to read from S3. 尝试从S3读取时，我面临以下错误。

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/FOLDER%2FFILE.json' - ResponseCode=400, ResponseMessage=Bad Request

I suspect the error is caused due to "/" being converted to "%2F" by some internal function as the error shows '/FOLDER%2FFILE.json' instead of '/FOLDER/FILE.json' 我怀疑错误是由于某些内部函数将“/”转换为“％2F”引起的，因为错误显示'/FOLDER%2FFILE.json'而不是'/FOLDER/FILE.json'

Answer 1

Your spark (jvm) application cannot read environment variable if you don't tell it to, so a quick work around : 如果您不告诉它，您的spark（jvm）应用程序无法读取环境变量，因此请快速解决：

spark.sparkContext
     .hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext
     .hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)

You'll also need to precise the s3 endpoint : 您还需要确定s3端点：

spark.sparkContext
     .hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");

To know more about what is AWS S3 Endpoint, refer to the following documentation : 要了解有关什么是AWS S3 Endpoint的更多信息，请参阅以下文档：

AWS Regions and Endpoints . AWS区域和端点。
Working with Amazon S3 Buckets . 使用Amazon S3存储桶。

无法使用spark读取s3存储桶

问题描述

1 个解决方案

解决方案1
1 2017-06-16 13:11:42

无法使用spark读取s3存储桶

问题描述

1 个解决方案

解决方案1 1 2017-06-16 13:11:42

解决方案1
1 2017-06-16 13:11:42