簡體   English   中英

我可以從 py spark 中的本地文件讀取,但我無法在本地文件中寫入數據幀

[英]I can read from local file in py spark but i can't write data frame in local file

df.write.csv("sdf") 

" 21/07/24 15:27:23 ERROR FileFormatWriter: Aborting job a9914f88-3ab9-480a-984f-33d0e598c0fc. java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/ String;I)Z at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645) at org.apache.hadoop.fs .FileUtil.canRead(FileUtil.java:1230) at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493) at org.apache.hadoop. fs.FileSystem.listStatus(FileSystem.java:1868) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(Checksum FileSystem.java:678) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) at org.apache.hadoop.mapreduce.lib.output .FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:332) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:402) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java :375) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:182) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:220) at org .apache.spark.sql.execution.datasources.Insert IntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec. sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1( SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151 ) 在 org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) 在 org.apache.spark.sql.execution.SparkSlan.4execute8(SparkPlan.execute.scala:215) 552156588:176) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131) at org .apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org. apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache. spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.883688262367 88.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:979) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(未知來源) 在 sun.reflect.DelegatingMethodAccessorImpl.invoke(未知來源) 在 java.lang.reflect.Method.invoke(未知來源) 在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.88213284869458 :244) 在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 在 py4j.Gateway.invoke(Gateway.java:282) 在 py4j.commands.AbstractCommand.invokeMethod(AbstractComm and.java:132) 在 py4j.commands.CallCommand.execute(CallCommand.java:79) 在 py4j.GatewayConnection.run(GatewayConnection.java:238) 在 java.Traceback(Unknown Sourceback.Thread)最后調用):文件“”,第 1 行,在文件“C:\spark\python\pyspark\sql\readwriter.py”中,第 1372 行,在 csv self._jwrite.csv(path) 文件“C:\spark\ python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py",第 1305 行,調用文件“C:\spark\python\pyspark\sql\utils.py”,第 111 行,deco 返回f(*a, **kw) 文件“C:\spark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py”,第 328 行,在 get_return_value py4j.protocol.Py4JJavaError:一個錯誤調用 o40.csv 時發生。 :org.apache.spark.SparkException:作業中止。 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188) at org. apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache. spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark. sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) 在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala: 151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark .sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131) at org.apache.spark.sql.DataFrameWriter.$ anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$ .withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withA ctive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save( DataFrameWriter.scala:293) at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:979) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(未知來源)位於 java.lang.reflect.Method.invoke(未知來源)位於 py4j.reflection.MethodInvoker.invoke(MethodInvoker.882132469458 88:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j. commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop. io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows. access(NativeIO.java:645) at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230) at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435) at org.apache.hadoop.fs.RawLocalFile System.listStatus(RawLocalFileSystem.java:493) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) at org.apache.hadoop.fs .ChecksumFileSystem.listStatus(ChecksumFileSystem.java:678) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) at org.apache.hadoop. mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:332) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:402) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter. commitJob(FileOutputCommitter.java:375)在 org.88352839 602088.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:182) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:220)

除了獲取 winutils.exe 和設置 hadoop_home。 請檢查您的 bin 中是否有 hadoop.dll 二進制文件。 如果沒有,則從 github repo 下載它。

https://github.com/cdarlint/winutils/blob/master/hadoop-3.2.1/bin/hadoop.dll

它對我有用。

將 hadoop.dll 添加到 HADOOP_HOME 路徑可以解決問題.. 謝謝虔誠

對於PySpark 3.3.1Win10 Java 18

  • 下載帶有winutils.exehadoop.dllbin文件夾(從這里https://github.com/cdarlint/winutils或這里https://github.com/steveloughran/winutils )(我的版本是3.0.0或任何其他更高版本然后3.0.0 ),將文件夾放入C:\hadoop
  • Go 到System variables並創建HADOOP_HOME並設置為C\hadoop
  • 然后將它添加到System vars下的path ,如%HADOOP_HOME%\bin

如果在寫入文件 json 或 parquet 等過程中出現任何問題,請嘗試將hadoop.dll文件放入System32文件夾中。 就是這樣。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM