I want to access s3 from spark, I don't want to configure any secret and access keys, I want to access with configuring the IAM role, so I followed the steps given in s3-spark
But still it is not working from my EC2 instance (which is running standalone spark)
it works when I tested
[ec2-user@ip-172-31-17-146 bin]$ aws s3 ls s3://testmys3/
2019-01-16 17:32:38 130 e.json
but it did not work when I tried like below
scala> val df = spark.read.json("s3a://testmys3/*")
I am getting the below error
19/01/16 18:23:06 WARN FileStreamSink: Error while looking for metadata directory.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: E295957C21AFAC37, AWS Error Code: null, AWS Error Message: Bad Request
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)
this config worked
./spark-shell \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
--conf spark.hadoop.fs.s3a.endpoint=s3.us-east-2.amazonaws.com \
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true \
--conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
"400 Bad Request" is fairly unhelpful, and not only does S3 not provide much, the S3A connector doesn't date print much related to auth either. There's a big section on troubleshooting the error
The fact it got as far as making a request means that it has some credentials, only the far end doesn't like them
Possibilities
You should automatically have IAM auth as an authentication mechanism with the S3A connector; its the one which is checked last after: config & env vars.
fs.s3a.aws.credentials.provider
-it must be unset or contain the option com.amazonaws.auth.InstanceProfileCredentialsProvider
hadoop
on the command line, grab storediaghadoop jar cloudstore-0.1-SNAPSHOT.jar storediag s3a://testmys3/
it should dump what it is up to regarding authentication.
Update
As the original poster has commented, it was due to v4 authentication being required on the specific S3 endpoint. This can be enabled on the 2.7.x version of the s3a client, but only via Java system properties. For 2.8+ there are some fs.s3a. options you can set it instead
fs.s3a.aws.credentials.provider-- com.cloudera.com.amazonaws.auth.InstanceProfileCredentialsProvider
fs.s3a.endpoint-- s3-ap-northeast-2.amazonaws.com
fs.s3.impl-- org.apache.hadoop.fs.s3a.S3AFileSystem
val rdd=sc.textFile("s3a://path/file") rdd.count() rdd.take(10).foreach(println)
It works for me
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.