[英]Can't read csv from S3 to pyspark dataframe on a EC2 instance on AWS
I can't read in a csv file from S3 to a pyspark dataframe on EC2 instance on AWS cloud.我无法将 csv 文件从 S3 读入 AWS 云上 EC2 实例上的 pyspark dataframe。 I have created a spark cluster on AWS using Flintrock.
我使用 Flintrock 在 AWS 上创建了一个 spark 集群。 Here is my Flintrock configuration file (on a local machine):
这是我的 Flintrock 配置文件(在本地机器上):
services:
spark:
version: 3.0.0
hdfs:
version: 2.7.3
provider: ec2
providers:
ec2:
key-name: xxxx
identity-file: /home/yyyy/keys/xxxx.pem
instance-type: t2.micro
region: us-east-1
ami: ami-02354e95b39ca8dec
user: ec2-user
launch:
num-slaves: 1
install-hdfs: False
Then I start the cluster on AWS as follows:然后我在 AWS 上启动集群,如下所示:
flintrock launch mysparkcluster
The cluster gets created and seems to work.集群已创建并且似乎可以正常工作。 Then I install python3 as follows:
然后我按如下方式安装python3:
flintrock run-command mysparkcluster 'sudo yum install -y python3'
Then I login to the master node:然后我登录到主节点:
flintrock login mysparkcluster
Then I do:然后我做:
export PYSPARK_PYTHON=/usr/bin/python3
Then I start the pyspark shell (so far it works:):然后我启动 pyspark shell(目前有效:):
pyspark --master spark://0.0.0.0:7077 --packages org.apache.hadoop:hadoop-aws:2.7.4
Below in the pyspark shell I set the required credentials.在 pyspark shell 下面我设置了所需的凭据。 Since I am using aws educate account, my understanding is that I get only always temporary sessions, for which I need session token in addition to access-key-id and secret key:
由于我使用的是 aws educate 帐户,我的理解是我总是只能获得临时会话,为此我需要 session 令牌以及 access-key-id 和密钥:
from pyspark.sql import SQLContext
sqlc = SQLContext(sc)
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "KEYXYZ")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "SECRETXYZ")
spark._jsc.hadoopConfiguration().set("fs.s3a.session.token", "VERYLONGTOKEN")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
And then I try to read in the csv file as follows:然后我尝试按如下方式读取 csv 文件:
df = sqlc.read.csv('s3a://mybucket/myfile.csv', header='true', inferSchema='true')
I am getting the follwoing error:我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ec2-user/spark/python/pyspark/sql/readwriter.py", line 535, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/ec2-user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/home/ec2-user/spark/python/pyspark/sql/utils.py", line 131, in deco
return f(*a, **kw)
File "/home/ec2-user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.csv.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: EEAD03F2F4012750, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: mi9O78oh2QbtklTCrCQkv6SuPFR0UR6zl5CB4kuHTCJD7mdNrA6s5R8oejWJ0MAlAS8zOPJY7FY=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1439)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
What am I doing wrong?我究竟做错了什么?
Thank you for you tips in advance!提前谢谢你的提示!
Probably something with the way I supplied my credentials via hadoopConfiguration().set() in the python code was wrong.可能与我在 python 代码中通过 hadoopConfiguration().set() 提供凭据的方式有误。 But there is another way of configuring flintrock (and more generally EC2 instances) to be able to access S3 without supplying credentials in the code (this is actually a recomded way of doing this when dealing with temporary credentials from AWS).
但是还有另一种配置 flintrock(以及更普遍的 EC2 实例)的方法,使其无需在代码中提供凭据即可访问 S3(这实际上是处理来自 AWS 的临时凭据时推荐的做法)。 The following helped:
以下帮助:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.