[英]Error while writing a dataframe into parquet format
I am trying to convert a dataframe to parquet format in S3 bucket of the AWS.我正在尝试将数据框转换为 AWS 的 S3 存储桶中的镶木地板格式。 But, i am getting the error that 's3a' bucket I am using is not found.I am using below code for the conversion.但是,我收到了我正在使用的“s3a”存储桶未找到的错误。我正在使用下面的代码进行转换。
df.write.mode('overwrite').parquet(folder_path)
The error I am getting is,我得到的错误是,
An error occurred while calling o328.parquet.
: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:376)
at org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.createS3Client(DefaultS3ClientFactory.java:51)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:229)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClasses(Configuration.java:2642)
at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:373)
... 26 more
I am using, spark 3.2.1 and Hadoop 3.2我正在使用 spark 3.2.1 和 Hadoop 3.2
I already downloaded the jar files, and pasted those in jar folder of spark, restarted my kernel.我已经下载了jar文件,并将它们粘贴到spark的jar文件夹中,重新启动了我的内核。
please provide any solution.请提供任何解决方案。
you have an inconsistent set of hadoop- jars, either locally or on the cluster.你有一组不一致的hadoopjars,无论是在本地还是在集群上。 As the hadoop s3a troubleshooting docs note, mixing hadoop- jars is a way to see interesting stack traces, just as mixing spark- jars would be.正如 hadoop s3a 故障排除文档所指出的,混合 hadoopjars 是一种查看有趣堆栈跟踪的方法,就像混合 sparkjars 一样。
Fix: use a consistent set of hadoop jars across the entire cluster.修复:在整个集群中使用一组一致的 hadoop jar。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.