简体   繁体   English

将数据帧写入镶木地板格式时出错

[英]Error while writing a dataframe into parquet format

I am trying to convert a dataframe to parquet format in S3 bucket of the AWS.我正在尝试将数据框转换为 AWS 的 S3 存储桶中的镶木地板格式。 But, i am getting the error that 's3a' bucket I am using is not found.I am using below code for the conversion.但是,我收到了我正在使用的“s3a”存储桶未找到的错误。我正在使用下面的代码进行转换。

df.write.mode('overwrite').parquet(folder_path)

The error I am getting is,我得到的错误是,

An error occurred while calling o328.parquet.
: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
    at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:376)
    at org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.createS3Client(DefaultS3ClientFactory.java:51)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:229)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
    at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
    at org.apache.hadoop.conf.Configuration.getClasses(Configuration.java:2642)
    at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:373)
    ... 26 more

I am using, spark 3.2.1 and Hadoop 3.2我正在使用 spark 3.2.1 和 Hadoop 3.2

I already downloaded the jar files, and pasted those in jar folder of spark, restarted my kernel.我已经下载了jar文件,并将它们粘贴到spark的jar文件夹中,重新启动了我的内核。

please provide any solution.请提供任何解决方案。

  1. IAMInstanceCredentialsProvider is in hadoop-aws 3.3. IAMInstanceCredentialsProvider 在 hadoop-aws 3.3 中。
  2. in that version of hadoop the core-default.xml file in hadoop-common lists it as one of the default credential providers for fs.s3a.aws.credentials.provider.在该版本的 hadoop 中,hadoop-common 中的 core-default.xml 文件将其列为 fs.s3a.aws.credentials.provider 的默认凭据提供程序之一。
  3. As S3AUtils is found, there is some version of hadoop-aws on your classpath找到 S3AUtils 后,您的类路径中有一些版本的 hadoop-aws
  4. But as it cannot find the file, it is of a version < 3.3.0但由于它找不到文件,它的版本 < 3.3.0

you have an inconsistent set of hadoop- jars, either locally or on the cluster.你有一组不一致的hadoopjars,无论是在本地还是在集群上。 As the hadoop s3a troubleshooting docs note, mixing hadoop- jars is a way to see interesting stack traces, just as mixing spark- jars would be.正如 hadoop s3a 故障排除文档所指出的,混合 hadoopjars 是一种查看有趣堆栈跟踪的方法,就像混合 sparkjars 一样。

Fix: use a consistent set of hadoop jars across the entire cluster.修复:在整个集群中使用一组一致的 hadoop jar。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM