简体   繁体   English

如何使用 Spark 3.0.0 读写 S3?

[英]How to read and write from/to S3 using Spark 3.0.0?

I'm trying to launch a Spark application which should be able to read and write to S3, using Spark Operator on Kubernetes and pySpark version 3.0.0.我正在尝试使用 Kubernetes 和 pySpark 版本 3.0.0 上的 Spark Operator 启动应该能够读取和写入 S3 的 Spark 应用程序。 The Spark Operator is wworking nicely, but I soon realized that the application launched can't read files from S3 properly. Spark Operator 运行良好,但我很快意识到启动的应用程序无法从 S3 正确读取文件。

This command:这个命令:

spark.read.json("s3a://bucket/path/to/data.json")

is throwing this exception:正在抛出此异常:

py4j.protocol.Py4JJavaError: An error occurred while calling o58.json.
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I tried this first using gcr.io/spark-operator/spark-py:v3.0.0 as Spark image, and then tried adding some .jars to it with no success:我首先尝试使用gcr.io/spark-operator/spark-py:v3.0.0作为 Spark 映像,然后尝试向其中添加一些.jars ,但没有成功:

ADD https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar $SPARK_HOME/jars

Here's my spark conf:这是我的火花conf:

    "spark.hadoop.fs.s3a.endpoint": "S3A_ENDPOINT"
    "spark.hadoop.fs.s3a.access.key": "ACCESS_KEY"
    "spark.hadoop.fs.s3a.secret.key": "SECRET_KEY"
    "spark.hadoop.fs.s3a.connection.ssl.enabled": "false"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3a.path.style.access": "true"
    "spark.driver.extraClassPath": "/opt/spark/jars/*"
    "spark.executor.extraClassPath": "/opt/spark/jars/*"

And my $SPARK_HOME is /opt/spark .我的$SPARK_HOME/opt/spark

Is anyone able to read/write from S3 using Spark 3.0.0?有人可以使用 Spark 3.0.0 从 S3 读取/写入吗? Is this an issue with pyspark, exclusively?这是 pyspark 的问题吗? How can I "fix" this?我怎样才能解决这个问题? Thanks in advance!提前致谢!

I figured out how to do it: Here's a fork with the changes I made to the base docker image (just a few changes):我想出了如何做到这一点:这是我对基础 docker 图像所做的更改(仅进行了一些更改):

https://github.com/Coqueiro/spark/tree/branch-3.0-s3 I created a Makefile to aid distribution creation, but I basically just followed the official doc: https://github.com/Coqueiro/spark/tree/branch-3.0-s3我创建了一个 Makefile 来帮助创建分发,但我基本上只是遵循官方文档:

http://spark.apache.org/docs/latest/building-spark.html http://spark.apache.org/docs/latest/building-spark.html

Also, here's the image, already built and pushed to Docker Hub: https://hub.docker.com/repository/docker/coqueirotree/spark-py此外,这里的图像,已经构建并推送到 Docker 集线器: https://hub.docker.com/repository/docker/coqueirotree

It has Spark 3.0.0, Hadoop 3.2.0, S3A and Kubernetes support.它支持 Spark 3.0.0、Hadoop 3.2.0、S3A 和 Kubernetes。

Have you tried using spark jars that have pre-built Hadoop libraries https://spark.apache.org/downloads.html you can also add Hadoop dependencies to your classpath. Have you tried using spark jars that have pre-built Hadoop libraries https://spark.apache.org/downloads.html you can also add Hadoop dependencies to your classpath.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用java中的spark从AWS S3读取.xls文件? 并且无法读取 sheetName - How to read an .xls file from AWS S3 using spark in java? And unable to read sheetName 如何在没有火花的情况下从 S3 读取 Parquet 文件? Java - How to read Parquet file from S3 without spark? Java 将Hive从S3分区读取到Spark - Read Hive partitioned from S3 into Spark Spark 不从 s3 读取/写入信息(ResponseCode=400,ResponseMessage=Bad Request) - Spark doesn't read/write information from s3 (ResponseCode=400, ResponseMessage=Bad Request) Spark 使用 sc.textFile ("s3n://...) 从 S3 读取文件 - Spark read file from S3 using sc.textFile ("s3n://...) 无法从Java Dataset for Spark中的AWS S3读取数据 - Unable to read data from AWS S3 in Java Dataset for Spark 如何使用AWS STS在s3中为文件的特定资源创建只读和仅写令牌 - how to create read only and write only token for specific resource for a file in s3 using AWS STS 如何从Spark cassandra连接器读取指标(写入表作者的时间) - How to Read metrics from Spark cassandra connector(Table Writer's time taken to write) 如何在S3中同时读写文件? - How can I simultaneously read and write a file in S3? 如何从EMR中的s3中读取文件? - How to read a file from s3 in EMR?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM