简体   繁体   English

使用没有Databricks的scala将spark 3.0 sql数据帧写入CSV文件时出错

[英]Error while writing spark 3.0 sql dataframe to CSV file using scala without Databricks

I am using the Apache spark 3.0 for the development.我正在使用 Apache spark 3.0 进行开发。 I am reading the data from the txt file and after generating the rdd, I am converting into dataframe.我正在从 txt 文件中读取数据,在生成 rdd 后,我正在转换为数据帧。 My data is huge so I am taking 100 values from the dataframe and generated the new dataframe with the schema.我的数据很大,所以我从数据框中获取 100 个值并使用模式生成新的数据框。 After I am trying to write this as a csv file but I am getting the below Error.在我尝试将其编写为 csv 文件后,但出现以下错误。 Also I filled my null value in the dataframe also.我也在数据框中填充了我的空值。 Still haven't got any solution.仍然没有任何解决办法。 Please help me into this.请帮我解决这个问题。 I tried every solution over the internet still confusing what to do.我尝试了互联网上的所有解决方案,但仍然困惑该怎么做。 I don't want to use databrick here.我不想在这里使用数据块。

My Schema is Below:我的架构如下:

```
root
 |-- A1: string (nullable = true)
 |-- A2: string (nullable = true)
 |-- A3: string (nullable = true)
 |-- A4: string (nullable = true)
 |-- A5: double (nullable = true)
 |-- A6: double (nullable = true)
 |-- A7: double (nullable = true)
 |-- A8: double (nullable = true)
 |-- A9: double (nullable = true)
 |-- A10: string (nullable = true)
 |-- A11: string (nullable = true)

```

And code which I am using to write csv is below:我用来编写 csv 的代码如下:

```
cdf.coalesce(1).write.format("csv").option("header",true).mode("overwrite").save("C:/Users/aksparmar/Documents/test30final.csv")

```

  org.apache.spark.SparkException: Job aborted.
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:226)
      at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
      at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
      at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
      at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
      at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
      at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
      at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
      at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
      at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
      at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
      at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
      at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
      at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
      at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
      at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
      at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
      at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
      ... 36 elided
    Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 102, AKSPARMAR.com, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
    Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
        at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
        at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
        at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
        at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:678)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:484)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:597)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:560)
        at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
        at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
        at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:245)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:79)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:275)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
        ... 9 more
    
    Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
      at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
      at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
      at scala.Option.foreach(Option.scala:407)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:195)
      ... 57 more
    Caused by: org.apache.spark.SparkException: Task failed while writing rows.
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:291)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      at org.apache.spark.scheduler.Task.run(Task.scala:127)
      at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      ... 1 more
    Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
      at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
      at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
      at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
      at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
      at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
      at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
      at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
      at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:678)
      at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:484)
      at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:597)
      at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:560)
      at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
      at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
      at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:245)
      at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:79)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:275)
      at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
      at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
      ... 9 more

Hadoop.dll file for Hadoop3.2.1 is missing in the bin folder. bin 文件夹中缺少 Hadoop3.2.1 的 Hadoop.dll 文件。 We require winutil.exe and Hadoop.dll file according to your Hadoop version.根据您的 Hadoop 版本,我们需要 winutil.exe 和 Hadoop.dll 文件。 Issue resolved.问题解决了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在PySpark中将Spark DataFrame写入csv时出错 - Error while writing Spark DataFrame to csv in pyspark 将 spark dataframe 写入 csv 文件时出现“调用 o58.csv 时出错”错误 - Getting "An error occurred while calling o58.csv" error while writing a spark dataframe into a csv file 名称“spark_write_parquet”未定义 - 在 python/databricks/spark 中写入镶木地板文件时出错 - name 'spark_write_parquet' is not defined - error while writing parquet file in python/databricks/spark 将大型Spark数据框写入CSV文件 - Writing a big Spark Dataframe into a csv file 在 azure 数据块中使用 spark 读取无法读取 csv 文件 - Unable to read csv file using spark read in azure databricks 直接在databricks中使用openpyxl修改xlsx文件,无需pandas/dataframe - Modifying the xlsx file using openpyxl in databricks directly without pandas/dataframe 无法将大型 spark 数据帧从数据块写入 csv - Failure in writing large spark dataframes to csv from databricks 使用 python 写入 csv 文件时出错 - error while writing to csv file with python Python spark:如何在 databricks 中使用 spark 并行化 Spark Dataframe 计算 - Python spark : How to parellelize Spark Dataframe compute using spark in databricks Databricks - 即使指定为 CSV 文件,也将 CSV 写入 Parquet 文件夹 - Databricks - Writing CSV as a Parquet folder even when specified as a CSV file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM