Spark 抛出错误：将数据帧写入 S3 时出现 FileNotFoundException

Question

我们有一个数据框，我们想将其作为拼花格式和覆盖模式写入 s3。
每次我们写 dataframe 它总是一个新文件夹。 写入s3位置的代码如下：

        df.write
          .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
          .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
          .option("maxRecordsPerFile", maxRecordsPerFile)
          .mode("overwrite")
          .format(format)
          .save(output)

我们观察到的是，有时我们会得到FilenotFoundException （下面的完整跟踪）。 有人可以帮我理解吗

当我写入新的 s3 位置时（意味着没有人从该位置读取）； 为什么写入程序会抛出以下异常？
如何解决？ --我看到几个 stackoverflows 指向这个异常。 但是他们说，当您尝试在写入发生时进行读取时，就会发生这种情况。 但我的情况并非如此。 当写发生时我不读。
我的火花是2.3.2 ； EMR-5.18.1 ； 代码写在scala
我使用s3://作为 output 文件夹路径。 我应该将其更改为某些s3n或s3a吗？ 那会有帮助吗？

Caused by: java.io.FileNotFoundException: No such file or directory 's3://BUCKET/snapshots/FOLDER/_bid_9223370368440344985/part-00020-693dfbcb-74e9-45b0-b892-0b19fa92365c-c000.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:104)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:101)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Answer 1

我终于能够解决问题

df: DataFrame是在同一个s3文件夹上形成的，该文件夹正在以overwrite模式写入。
所以在overwrite期间； 源文件夹正在被清除——这导致了错误

希望这对某人有帮助。

Spark 抛出错误：将数据帧写入 S3 时出现 FileNotFoundException

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-11-17 14:26:55

Spark 抛出错误：将数据帧写入 S3 时出现 FileNotFoundException

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-11-17 14:26:55

解决方案1
0 已采纳 2022-11-17 14:26:55