简体   繁体   English

Pyspark S3 错误:java.lang.NoClassDefFoundError:com/amazonaws/services/s3/model/MultiObjectDeleteException

[英]Pyspark S3 error: java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException

Been unsuccessful setting a spark cluster that can read AWS s3 files.设置可以读取 AWS s3 文件的 spark 集群失败。 The software I used are as follows:我使用的软件如下:

  1. hadoop-aws-3.2.0.jar hadoop-aws-3.2.0.jar
  2. aws-java-sdk-1.11.887.jar aws-java-sdk-1.11.887.jar
  3. spark-3.0.1-bin-hadoop3.2.tgz spark-3.0.1-bin-hadoop3.2.tgz

Using python version: Python 3.8.6使用python版本:Python 3.8.6

from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
import sys

spark = (SparkSession.builder
         .appName("AuthorsAges")
         .appName('SparkCassandraApp')
         .getOrCreate())


spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "access-key")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "secret-key")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "")


input_file='s3a://spark-test-data/Fire_Department_Calls_for_Service.csv'

file_schema = StructType([StructField("Call_Number",StringType(),True),
        StructField("Unit_ID",StringType(),True),
        StructField("Incident_Number",StringType(),True),
...
...
# Read file into a Spark DataFrame
input_df = (spark.read.format("csv") \
            .option("header", "true") \
            .schema(file_schema) \
            .load(input_file))

The code fails when it starts to execute the spark.read.format.代码在开始执行 spark.read.format 时失败。 It appears that it can't find the class. java.lang.NoClassDefFoundError: com.amazonaws.services.s3.model.MultiObjectDeleteException似乎找不到 class.java.lang.NoClassDefFoundError: com.amazonaws.services.s3.model.MultiObjectDeleteException

  File "<stdin>", line 1, in <module>
  File "/usr/local/spark/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/readwriter.py", line 178, in load
    return self._df(self._jreader.load(path))
  File "/usr/local/spark/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/usr/local/spark/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py", line 128, in deco
    return f(*a, **kw)
  File "/usr/local/spark/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2532)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2497)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.s3.model.MultiObjectDeleteException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

I've been trying to find the right combination for the above jars and python but I couldn't find the right mix.我一直在尝试为上述 jars 和 python 找到正确的组合,但我找不到正确的组合。 I'm getting all kinds of NoClassDefFoundError so I decided to use the latest versions of all the jars and python I listed above but still unsuccessful.我遇到了各种 NoClassDefFoundError,所以我决定使用上面列出的所有 jars 和 python 的最新版本,但仍然不成功。

I would like to know what versions of jars and python did you use to have successfully setup a cluster that is able to access s3 using s3a via pyspark?我想知道您使用什么版本的 jars 和 python 成功设置了一个能够通过 pyspark 使用 s3a 访问 s3 的集群? Thank you in advance for response/help.预先感谢您的回复/帮助。

Hadoop 3.2 was built against 1.11.563; Hadoop 3.2 是针对 1.11.563 构建的; stick the full shaded sdk of that specific version in your classpath "aws-java-sdk-bundle" and all should be well.将特定版本的完整阴影 sdk 粘贴到您的类路径“aws-java-sdk-bundle”中,一切都应该没问题。

The SDK has been "fussy" in the past...and upgrade invariably causes surprises. SDK 过去一直很“挑剔”……升级总是会带来惊喜。 For the curious Qualifying an AWS SDK update .对于好奇的合格 AWS 开发工具包更新 It's probably about time someone does it again.可能是时候有人再做一次了。

I was able to solve this issue on Spark 3.0/ Hadoop 3.2.我能够在 Spark 3.0/Hadoop 3.2 上解决这个问题。 I documented my answer here as well - AWS EKS Spark 3.0, Hadoop 3.2 Error - NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException我也在这里记录了我的答案 - AWS EKS Spark 3.0, Hadoop 3.2 Error - NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException

Use following AWS Java SDK bundle and this issue will be solved -使用以下 AWS Java SDK 包,此问题将得到解决 -

aws-java-sdk-bundle-1.11.874.jar ( https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.874 ) aws-java-sdk-bundle-1.11.874.jar ( https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.874 )

So I cleaned-up everything and re-installed the following versions of jars and it worked: hadoop-aws-2.7.4.jar, aws-java-sdk-1.7.4.2.jar.因此,我清理了所有内容并重新安装了以下版本的 jar 并且它起作用了:hadoop-aws-2.7.4.jar、aws-java-sdk-1.7.4.2.jar。 Spark install version: spark-2.4.7-bin-hadoop2.7. Spark 安装版本:spark-2.4.7-bin-hadoop2.7。 Python version: Python 3.6. Python 版本:Python 3.6。

In addition to the answers, I will add my 2 cents on top of them.除了答案之外,我还会在上面加上我的 2 美分。 Although adding the aws-java-sdk-bundle works perfectly, I found it better to add the specific dependencies to make the package smaller.尽管添加aws-java-sdk-bundle效果很好,但我发现最好添加特定的依赖项以使 package 更小。

I replaced aws-java-sdk-bundle by aws-java-sdk-sts , aws-java-sdk-s3 , and aws-java-sdk-dynamodb .我用aws-java-sdk-bundle aws-java-sdk-stsaws-java-sdk-s3aws-java-sdk-dynamodb The final size went from ~200MB to ~125MB.最终大小从 ~200MB 到 ~125MB。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark s3 错误:java.lang.NoClassDefFoundError:com/amazonaws/AmazonServiceException - Pyspark s3 error : java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException PYSPARK 连接到 aws S3 时出错:py4j.protocol.Py4JJavaError:java.lang.NoClassDefFoundError:com/amazonaws/AmazonClientException - PYSPARK Error connecting to aws S3: py4j.protocol.Py4JJavaError: java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException AWS Lambda Java 到 S3 - com.amazonaws.services.s3.AmazonS3ClientBuilder 上的 java.lang.ClassNotFoundException - AWS Lambda Java to S3 - java.lang.ClassNotFoundException on com.amazonaws.services.s3.AmazonS3ClientBuilder 使用 Amazon S3 配置 Pyspark 给出 java.lang.ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider - Configuring Pyspark with Amazon S3 giving java.lang.ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider :java.lang.NoSuchMethodError:com.amazonaws.services.s3.transfer.TransferManager。<init> (S3;Ljava/util/concurrent/ThreadPoolExecutor;)V</init> - : java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(S3;Ljava/util/concurrent/ThreadPoolExecutor;)V 什么会导致 AWS S3 MultiObjectDeleteException? - What could cause AWS S3 MultiObjectDeleteException? AWS S3 - java.net.UnknownHostException: repo-user-bucket.s3.us-west-2.amazonaws.com - AWS S3 - java.net.UnknownHostException: repo-user-bucket.s3.us-west-2.amazonaws.com 上传文件到 S3 服务 403 错误(com.amazonaws.AWSS3TransferUtilityErrorDomain 错误 2.)Swift iOS - Upload file to S3 service 403 ERROR (com.amazonaws.AWSS3TransferUtilityErrorDomain error 2.) Swift iOS com.amazonaws.services.s3.model.AmazonS3Exception:您提供的 XML 格式不正确或未根据我们发布的架构进行验证 - com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema Py4JJavaError - pySpark 在 S3 上读取 - Py4JJavaError - pySpark read on S3
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM