![](/img/trans.png)
[英]PYSPARK Error connecting to aws S3: py4j.protocol.Py4JJavaError: java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException
[英]Py4JJavaError - pySpark read on S3
当我想从我的 S3 存储桶中读取文件时,我总是遇到同样的错误。 我正在使用 jupyter-lab 在 EC2 上工作,但在没有 jupyter-lab 和我的个人笔记本电脑上我得到了相同的结果。 这是我的代码。
# My spark configuration
conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.0')
#conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
conf.set('spark.hadoop.fs.s3a.access.key', key)
conf.set('spark.hadoop.fs.s3a.secret.key', secret)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# path to my test file (which a can read in local with same code
path = "s3a://bucket-name/folder/test.csv"
csv = spark.read.format("csv").load(path)
错误如下:
Py4JJavaError: An error occurred while calling o37.load.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)
at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1580)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
有时“calling o37.load”有不同的号码。 即(o.43)
我在 SparkSession.Builder 期间收到了一些 WARN:
:: loading settings :: url = jar:file:/home/ubuntu/.local/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-840765ed-4adc-4453-b354-a3a8093d3776;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;3.3.0 in central
found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.563/aws-java-sdk-bundle-1.11.563.jar ...
[SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.11.563!aws-java-sdk-bundle.jar (4888ms)
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
[SUCCESSFUL ] org.wildfly.openssl#wildfly-openssl;1.0.7.Final!wildfly-openssl.jar (22ms)
:: resolution report :: resolve 697ms :: artifacts dl 4998ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
org.apache.hadoop#hadoop-aws;3.3.0 from central in [default]
org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 2 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-840765ed-4adc-4453-b354-a3a8093d3776
confs: [default]
3 artifacts copied, 0 already retrieved (128050kB/319ms)
22/10/08 13:55:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
我尝试了这个中级教程以确保 haddop / java-sdk 版本兼容。
我试着阅读“二进制文件”格式和“图像”。 同样的结果。
我可以使用 python 使用 boto3 读取文件,但我需要使用 pySpark。
问题是我的 SparkSession 配置中的 hadoop 版本。 我输入以下命令来查找我的 haddop 版本:
> print(f"Hadoop version = {spark._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}")
Hadoop version = 3.3.2
其中 spark 是spark = SparkSession.builder.config(conf=conf).getOrCreate()
的结果。
所以我不得不按如下方式更改配置的第一行:
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.2')
但 ! 当我read.format("images")
时错误仍然存在,所以我仍在搜索相关内容,即使是否可以使用二进制文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.