简体   繁体   English

Spark和Amazon S3未在执行程序中设置凭证

[英]Spark and Amazon S3 not setting credentials in executors

Im doing a Spark program that reads and writes from Amazon S3.My problem is that It works if I execute in local mode (--master local[6]) but if i execute in the cluster (in other machines) I got an error with the credentials: 我正在做一个从Amazon S3读取和写入的Spark程序。我的问题是,如果我以本地模式(--master local [6])执行,它可以工作,但是如果我在集群(在其他机器上)执行,我会报错具有凭据:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 33, mmdev02.stratio.com): com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:155)
at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Caused by: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain 引起原因:com.amazonaws.AmazonClientException:无法从链中任何提供程序加载AWS凭证

My code is as follows: 我的代码如下:

    val conf = new SparkConf().setAppName("BackupS3")


    val sc = SparkContext.getOrCreate(conf)

sc.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-" + region + ".amazonaws.com")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.buffer.dir", "/var/tmp/spark")
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true");
System.setProperty("com.amazonaws.services.s3.enableV4", "true")

I can write to Amazon S3 but cannot read! 我可以写到Amazon S3,但看不到! I also had to send some properties when I do spark-submit because my region is Frankfurt and I had to enable V4: 提交火花时,我还必须发送一些属性,因为我所在的地区是法兰克福并且必须启用V4:

--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

I tried passing the credentials this way too. 我也尝试通过这种方式传递凭据。 If i put them in the hdfs-site.xml in every machine it works. 如果我将它们放在每台机器的hdfs-site.xml中,它将起作用。

My question, is how can I do it from code? 我的问题是,如何从代码中做到这一点? Why are the executors not getting the config i pass them from the code? 为什么执行者无法获得配置,而我却从代码中传递了它们?

I'm using Spark 1.5.2, hadoop-aws 2.7.1 and aws-java-sdk 1.7.4. 我正在使用Spark 1.5.2,hadoop-aws 2.7.1和aws-java-sdk 1.7.4。

Thanks 谢谢

  • Don't put secrets the keys, that leads to loss of secrets 不要将密钥放在秘密中,这会导致秘密丢失
  • If you are running in EC2, your secrets will be picked up automatically from the IAM feature; 如果您在EC2中运行,则将从IAM功能自动获取您的机密; the client asks a magic web server for session secrets. 客户端向魔术Web服务器询问会话机密。
  • ...which means: it may be that spark's automatic credential propagation is getting in the way. ...这意味着:可能是spark的自动证书传播受到了阻碍。 Unset your AWS_ env vars before submitting the work. 提交工作之前,请取消设置您的AWS_ env变量。

If you set these properties explicitly in your code, the values will only be visible to the driver process. 如果在代码中显式设置这些属性,则这些值仅对驱动程序进程可见。 The executors will not have a chance to pick up those credentials. 执行者将没有机会领取这些证书。

If you had set them in actual config file like core-site.xml , they will propagate. 如果您将它们设置在像core-site.xml这样的实际配置文件中,它们将会传播。

Your code would work in local mode because all operations are happening in a single process. 您的代码将在本地模式下工作,因为所有操作都在单个进程中进行。

Why it works on a cluster on small files but not large ones (*): The code could also work on unpartitioned files, where read operations are performed in the driver and partitions are then broadcast to executors. 为什么它可以在小文件而不是大文件(*)的群集上工作:代码也可以在未分区的文件上工作,在未分区的文件中,驱动程序中执行读取操作,然后将分区广播给执行者。 On partitioned files, where executors read individual partitions, the credentials won't be set on the executors so it fails. 在执行者读取单个分区的分区文件上,凭据不会在执行者上设置,因此它将失败。

Best to use standard mechanisms for passing credentials, or better yet, use EC2 roles and IAM policies in your cluster as EricJ's answer suggests. 最好使用标准机制来传递凭据,或者最好使用EricJ的答案所建议的在群集中使用EC2角色和IAM策略。 By default, if you do not provide credentials, EMRFS will look up temporary credentials via EC2 instance metadata service. 默认情况下,如果您不提供凭据,则EMRFS将通过EC2实例元数据服务查找临时凭据。

(*) I am still learning about this myself, and I may need to revise this answer as I learn more (*)我自己仍在学习,随着我了解更多,我可能需要修改此答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM