无法在 AWS 上的 EC2 实例上从 S3 读取 csv 到 pyspark dataframe

Question

I can't read in a csv file from S3 to a pyspark dataframe on EC2 instance on AWS cloud.我无法将 csv 文件从 S3 读入 AWS 云上 EC2 实例上的 pyspark dataframe。 I have created a spark cluster on AWS using Flintrock.我使用 Flintrock 在 AWS 上创建了一个 spark 集群。 Here is my Flintrock configuration file (on a local machine):这是我的 Flintrock 配置文件（在本地机器上）：

services:
  spark:
    version: 3.0.0
  hdfs:
    version: 2.7.3

provider: ec2

providers:
  ec2:
    key-name: xxxx
    identity-file: /home/yyyy/keys/xxxx.pem
    instance-type: t2.micro
    region: us-east-1
    ami: ami-02354e95b39ca8dec
    user: ec2-user

launch:
  num-slaves: 1
  install-hdfs: False

Then I start the cluster on AWS as follows:然后我在 AWS 上启动集群，如下所示：

flintrock launch mysparkcluster

The cluster gets created and seems to work.集群已创建并且似乎可以正常工作。 Then I install python3 as follows:然后我按如下方式安装python3：

flintrock run-command mysparkcluster 'sudo yum install -y python3'

Then I login to the master node:然后我登录到主节点：

flintrock login mysparkcluster

Then I do:然后我做：

export PYSPARK_PYTHON=/usr/bin/python3

Then I start the pyspark shell (so far it works:):然后我启动 pyspark shell（目前有效：）：

pyspark --master spark://0.0.0.0:7077 --packages org.apache.hadoop:hadoop-aws:2.7.4

Below in the pyspark shell I set the required credentials.在 pyspark shell 下面我设置了所需的凭据。 Since I am using aws educate account, my understanding is that I get only always temporary sessions, for which I need session token in addition to access-key-id and secret key:由于我使用的是 aws educate 帐户，我的理解是我总是只能获得临时会话，为此我需要 session 令牌以及 access-key-id 和密钥：

from pyspark.sql import SQLContext
sqlc = SQLContext(sc)

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "KEYXYZ")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "SECRETXYZ")
spark._jsc.hadoopConfiguration().set("fs.s3a.session.token", "VERYLONGTOKEN")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")

And then I try to read in the csv file as follows:然后我尝试按如下方式读取 csv 文件：

df = sqlc.read.csv('s3a://mybucket/myfile.csv', header='true', inferSchema='true')

I am getting the follwoing error:我收到以下错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ec2-user/spark/python/pyspark/sql/readwriter.py", line 535, in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/ec2-user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/home/ec2-user/spark/python/pyspark/sql/utils.py", line 131, in deco
    return f(*a, **kw)
  File "/home/ec2-user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.csv.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: EEAD03F2F4012750, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: mi9O78oh2QbtklTCrCQkv6SuPFR0UR6zl5CB4kuHTCJD7mdNrA6s5R8oejWJ0MAlAS8zOPJY7FY=
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
    at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1439)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

What am I doing wrong?我究竟做错了什么？

Thank you for you tips in advance!提前谢谢你的提示！

Answer 1

Probably something with the way I supplied my credentials via hadoopConfiguration().set() in the python code was wrong.可能与我在 python 代码中通过 hadoopConfiguration().set() 提供凭据的方式有误。 But there is another way of configuring flintrock (and more generally EC2 instances) to be able to access S3 without supplying credentials in the code (this is actually a recomded way of doing this when dealing with temporary credentials from AWS).但是还有另一种配置 flintrock（以及更普遍的 EC2 实例）的方法，使其无需在代码中提供凭据即可访问 S3（这实际上是处理来自 AWS 的临时凭据时推荐的做法）。 The following helped:以下帮助：

The flintrock docu , which says "Setup an IAM Role that grants access to S3 as desired. Reference this role when you launch your cluster using the --ec2-instance-profile-name option (or its equivalent in your config.yaml file)." flintrock 文档，上面写着“设置一个 IAM 角色，根据需要授予对 S3 的访问权限。在使用 --ec2-instance-profile-name 选项（或 config.yaml 文件中的等效项）启动集群时引用此角色”
This AWS documentation page that explains step-by-step how to do it. 这个 AWS 文档页面解释了如何一步一步地做到这一点。
Another useful AWS docu page . 另一个有用的 AWS 文档页面。
Please note: If you create the above role via AWS Console then the respective instance profile with the same name is created automatically, otherwise (if you use awscli or AWS API) you have to create the desired instance profile manually as an extra step.请注意：如果您通过 AWS 控制台创建上述角色，则会自动创建具有相同名称的相应实例配置文件，否则（如果您使用 awscli 或 AWS API）您必须手动创建所需的实例配置文件作为额外步骤。

无法在 AWS 上的 EC2 实例上从 S3 读取 csv 到 pyspark dataframe

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-08-21 09:51:11

无法在 AWS 上的 EC2 实例上从 S3 读取 csv 到 pyspark dataframe

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-08-21 09:51:11

解决方案1
0 已采纳 2020-08-21 09:51:11