从Apache Spark访问公共可用的Amazon S3文件

Question

I have a public available Amazon s3 resource (text file) and want to access it from spark. 我有一个公共可用的Amazon s3资源（文本文件），并希望从spark访问它。 That means - I don't have any Amazon credentials - it works fine if I want to just download it: 这意味着 - 我没有任何亚马逊凭证 - 如果我只想下载它，它可以正常工作：

val bucket = "<my-bucket>"
val key = "<my-key>"

val client = new AmazonS3Client
val o = client.getObject(bucket, key)
val content = o.getObjectContent // <= can be read and used as input stream

However, when I try to access the same resource from spark context 但是，当我尝试从spark上下文访问相同的资源时

val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val f = sc.textFile(s"s3a://$bucket/$key")
println(f.count())

I receive the following error with stacktrace: 我在stacktrace中收到以下错误：

Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
    at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
    at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
    at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
    at com.example.Main$.main(Main.scala:14)
    at com.example.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

I don't want to provide any AWS credentials - I just want to access resource anonymously (for now) - how to achieve this? 我不想提供任何AWS凭证 - 我只想匿名访问资源（目前） - 如何实现这一目标？ I probably need to make it use something like AnonymousAWSCredentialsProvider - but how to put it inside spark or hadoop? 我可能需要使用像AnonymousAWSCredentialsProvider这样的东西 - 但是如何将它放在spark或hadoop中呢？

PS My build.sbt just in case PS我的build.sbt以防万一

scalaVersion := "2.11.7"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.4.1",
  "org.apache.hadoop" % "hadoop-aws" % "2.7.1"
)

UPDATED: After I did some investigations - I see the reason why itsn't working. 更新：在我做了一些调查之后 - 我看到了它不起作用的原因。

First of all, S3AFileSystem creates AWS client with the following order of credentials: 首先，S3AFileSystem使用以下凭证顺序创建AWS客户端：

AWSCredentialsProviderChain credentials = new AWSCredentialsProviderChain(
    new BasicAWSCredentialsProvider(accessKey, secretKey),
    new InstanceProfileCredentialsProvider(),
    new AnonymousAWSCredentialsProvider()
);

"accessKey" and "secretKey" values are taken from the spark conf instance (keys must be "fs.s3a.access.key" and "fs.s3a.secret.key" or org.apache.hadoop.fs.s3a.Constants.ACCESS_KEY and org.apache.hadoop.fs.s3a.Constants.SECRET_KEY constants, which is more convenient). “accessKey”和“secretKey”值取自spark conf实例（键必须为“fs.s3a.access.key”和“fs.s3a.secret.key”或org.apache.hadoop.fs.s3a.Constants .ACCESS_KEY和org.apache.hadoop.fs.s3a.Constants.SECRET_KEY常量，这样更方便）。

Second - you probably see that AnonymousAWSCredentialsProvider is the third option (last priority) - what could possible be wrong with that? 第二 - 您可能会看到AnonymousAWSCredentialsProvider是第三个选项（最后优先级） - 可能出现的问题是什么？ See the implementation of AnonymousAWSCredentials: 查看AnonymousAWSCredentials的实现：

public class AnonymousAWSCredentials implements AWSCredentials {

    public String getAWSAccessKeyId() {
        return null;
    }

    public String getAWSSecretKey() {
        return null;
    }
}

It simply returns null for both access key and secret key. 它只是为访问密钥和密钥返回null。 Sounds reasonable. 听起来很合理。 But look inside AWSCredentialsProviderChain: 但请查看AWSCredentialsProviderChain：

AWSCredentials credentials = provider.getCredentials();

if (credentials.getAWSAccessKeyId() != null &&
    credentials.getAWSSecretKey() != null) {
    log.debug("Loading credentials from " + provider.toString());

    lastUsedProvider = provider;
    return credentials;
}

It doesn't choose provider in case both keys are null - that means anonymous credentials can't work. 如果两个键都为空，它不会选择提供程序 - 这意味着匿名凭据不起作用。 Looks like a bug inside aws-java-sdk-1.7.4. 看起来像aws-java-sdk-1.7.4中的一个bug。 I tried to use latest version - but it's incompatible with hadoop-aws-2.7.1. 我尝试使用最新版本 - 但它与hadoop-aws-2.7.1不兼容。

Any other ideas? 还有其他想法吗？

Answer 1

I personally never accessed public data from Spark. 我个人从未访问Spark的公共数据。 You can try to use dummy credentials, or to create ones just for this usage. 您可以尝试使用虚拟凭据，或仅为此用途创建一个凭据。 Set them directly on the SparkConf object. 直接在SparkConf对象上设置它们。

val sparkConf: SparkConf = ???
val accessKeyId: String = ???
val secretAccessKey: String = ???
sparkConf.set("spark.hadoop.fs.s3.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3n.awsAccessKeyId", accessKeyId)
sparkConf.set("spark.hadoop.fs.s3.awsSecretAccessKey", secretAccessKey)
sparkConf.set("spark.hadoop.fs.s3n.awsSecretAccessKey", secretAccessKey)

As an alternative, read the documentation of DefaultAWSCredentialsProviderChain to see where the credentials are looked for. 作为替代方案，请阅读DefaultAWSCredentialsProviderChain的文档以查看查找凭据的位置。 The list (order is important) is: 列表（顺序很重要）是：

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY 环境变量 - AWS_ACCESS_KEY_ID和AWS_SECRET_KEY

Java System Properties - aws.accessKeyId and aws.secretKey Java系统属性 - aws.accessKeyId和aws.secretKey

Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI 凭据配置文件位于所有AWS开发工具包和AWS CLI共享的默认位置（〜/ .aws / credentials）

Instance profile credentials delivered through the Amazon EC2 metadata service 通过Amazon EC2元数据服务提供的实例配置文件凭据

Answer 2

It seems you can now use the aws.credentials.provider config key to use anonymous access given by org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider, which correctly special case the anonymous provider. 现在看来你现在可以使用aws.credentials.provider配置密钥来使用org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider给出的匿名访问，这正是匿名提供者的特殊情况。 However, you need a newer hadoop-aws than 2.7, which means you also need a spark installation without a bundled hadoop. 但是，你需要一个比2.7更新的hadoop-aws，这意味着你还需要一个没有捆绑的hadoop的火花安装。

Here is how I did it colab: 这是我如何做它colab：

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
!tar xf spark-2.3.1-bin-without-hadoop.tgz
!pip install -q findspark
!pip install -q pyarrow

Now we install hadoop on the side, and set the output of hadoop classpath to SPARK_DIST_CLASSPATH , so spark can see it. 现在我们在侧面安装hadoop，并将hadoop classpath的输出设置为SPARK_DIST_CLASSPATH ，因此spark可以看到它。

import os
!wget -q http://mirror.nbtelecom.com.br/apache/hadoop/common/hadoop-2.8.4/hadoop-2.8.4.tar.gz
!tar xf hadoop-2.8.4.tar.gz
os.environ['HADOOP_HOME']= '/content/hadoop-2.8.4'
os.environ["SPARK_DIST_CLASSPATH"] = "/content/hadoop-2.8.4/etc/hadoop:/content/hadoop-2.8.4/share/hadoop/common/lib/*:/content/hadoop-2.8.4/share/hadoop/common/*:/content/hadoop-2.8.4/share/hadoop/hdfs:/content/hadoop-2.8.4/share/hadoop/hdfs/lib/*:/content/hadoop-2.8.4/share/hadoop/hdfs/*:/content/hadoop-2.8.4/share/hadoop/yarn/lib/*:/content/hadoop-2.8.4/share/hadoop/yarn/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/lib/*:/content/hadoop-2.8.4/share/hadoop/mapreduce/*:/content/hadoop-2.8.4/contrib/capacity-scheduler/*.jar"

Then we do like in https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/ , but add s3a and anonymous reading support, which is what the question is about. 然后我们确实喜欢https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/ ，但添加了s3a和匿名阅读支持，这就是问题所在。

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-without-hadoop"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.6,org.apache.hadoop:hadoop-aws:2.8.4 --conf spark.sql.execution.arrow.enabled=true --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider pyspark-shell'

And finally we can create the session. 最后我们可以创建会话。

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Answer 3

This is what helped me: 这对我有所帮助：

val session = SparkSession.builder()
  .appName("App")
  .master("local[*]") 
  .config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
  .getOrCreate()

val df = session.read.csv(filesFromS3:_*)

Versions: 版本：

"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.hadoop" % "hadoop-aws" % "2.8.5",

Documentation: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties 文档： https ： //hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authentication_properties

从Apache Spark访问公共可用的Amazon S3文件

问题描述

3 个解决方案

解决方案1
3 2015-07-18 22:45:41

解决方案2
1 2018-07-09 05:32:52

解决方案3
0 2018-12-17 09:40:35

从Apache Spark访问公共可用的Amazon S3文件

问题描述

3 个解决方案

解决方案1 3 2015-07-18 22:45:41

解决方案2 1 2018-07-09 05:32:52

解决方案3 0 2018-12-17 09:40:35

解决方案1
3 2015-07-18 22:45:41

解决方案2
1 2018-07-09 05:32:52

解决方案3
0 2018-12-17 09:40:35