简体   繁体   English

使用 Pyspark 无法写入数据文件?

[英]Can't write data files using Pyspark?

I'm having touble writing my dataframes to other file formats.我很难将我的数据帧写入其他文件格式。 From the online tutorials, it looks like this should work.从在线教程看来,这应该可行。

sc=sparkContext()
spark=SparkSession(sc)

df = spark.read.csv("table.csv")

df.write.orc("tests/file.orc")

but this (write.orc) results in this long error但这(write.orc)会导致这个长错误

20/06/01 13:54:32 ERROR Executor: Exception in task 0.0 in stage 63.0 (TID 63)

java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

20/06/01 13:54:32 WARN TaskSetManager: Lost task 0.0 in stage 63.0 (TID 63, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

20/06/01 13:54:32 ERROR TaskSetManager: Task 0 in stage 63.0 failed 1 times; aborting job

20/06/01 13:54:32 ERROR FileFormatWriter: Aborting job 55384644-29f4-4a2c-8a40-def2a2e2da73.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 63.0 failed 1 times, most recent failure: Lost task 0.0 in stage 63.0 (TID 63, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

Driver stacktrace:

              at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)

              at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)

              at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)

              at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

              at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

              at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)

              at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)

              at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)

              at scala.Option.foreach(Option.scala:257)

              at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)

              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)

              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)

              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)

              at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

              at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)

              at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)

              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)

              at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

              at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)

              at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)

              at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)

              at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)

              at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)

              at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)

              at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)

              at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)

              at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)

              at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)

              at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)

              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)

              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)

              at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:588)

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

              at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

              at java.lang.reflect.Method.invoke(Unknown Source)

              at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

              at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

              at py4j.Gateway.invoke(Gateway.java:282)

              at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

              at py4j.commands.CallCommand.execute(CallCommand.java:79)

              at py4j.GatewayConnection.run(GatewayConnection.java:238)

              at java.lang.Thread.run(Unknown Source)

Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              ... 1 more

Traceback (most recent call last):

  File "<input>", line 1, in <module>

  File "C:\Users\t-aldouc\AppData\Local\Programs\Python\Python37\lib\site-packages\pyspark\sql\readwriter.py", line 960, in orc

    self._jwrite.orc(path)

  File "C:\Users\t-aldouc\AppData\Local\Programs\Python\Python37\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__

    answer, self.gateway_client, self.target_id, self.name)

  File "C:\Users\t-aldouc\AppData\Local\Programs\Python\Python37\lib\site-packages\pyspark\sql\utils.py", line 63, in deco

   return f(*a, **kw)

  File "C:\Users\t-aldouc\AppData\Local\Programs\Python\Python37\lib\site-packages\py4j\protocol.py", line 328, in get_return_value

    format(target_id, ".", name), value)

py4j.protocol.Py4JJavaError: An error occurred while calling o1488.orc.

: org.apache.spark.SparkException: Job aborted.

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)

              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)

              at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

              at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)

              at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)

              at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)

              at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)

              at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)

              at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)

              at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)

              at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)

              at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)

              at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)

              at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)

              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)

              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)

              at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:588)

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

              at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

              at java.lang.reflect.Method.invoke(Unknown Source)

              at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

              at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

              at py4j.Gateway.invoke(Gateway.java:282)

              at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

              at py4j.commands.CallCommand.execute(CallCommand.java:79)

              at py4j.GatewayConnection.run(GatewayConnection.java:238)

              at java.lang.Thread.run(Unknown Source)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 63.0 failed 1 times, most recent failure: Lost task 0.0 in stage 63.0 (TID 63, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

Driver stacktrace:

              at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)

.....can't enter this many characters

I received a similar error while trying to write a parquet file and a csv, but managed to write the parquet file by converting df to a pandas dataframe and using to_parquet(), but I can't find a similar work around for.orc files.我在尝试编写镶木地板文件和 csv 时收到了类似的错误,但设法通过将 df 转换为 pandas dataframet.来编写镶木地板文件. How do I fix this?我该如何解决? I already tried adding the HADOOP PATH variable, but it did nothing我已经尝试添加 HADOOP PATH 变量,但它什么也没做

I had to run Pycharm as administrator then it worked fine我必须以管理员身份运行 Pycharm 然后它工作正常

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM