简体   繁体   English

Spark 抛出错误:将数据帧写入 S3 时出现 FileNotFoundException

[英]Spark throws Error: FileNotFoundException when writing data frame to S3

  1. we have a data frame which we want to write to s3 as parquet format and in overwrite mode.我们有一个数据框,我们想将其作为拼花格式和覆盖模式写入 s3。
  2. every time we write the dataframe it's always a new folder.每次我们写 dataframe 它总是一个新文件夹。 The code to write the s3 location is as follows:写入s3位置的代码如下:
        df.write
          .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
          .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
          .option("maxRecordsPerFile", maxRecordsPerFile)
          .mode("overwrite")
          .format(format)
          .save(output)

What we observe is, at times we get FilenotFoundException (full trace below).我们观察到的是,有时我们会得到FilenotFoundException (下面的完整跟踪)。 Can somebody help me understand有人可以帮我理解吗

  1. when i am writing to a new s3 location (meaning nobody is reading from the location);当我写入新的 s3 位置时(意味着没有人从该位置读取); why does the writing program throw the below exception?为什么写入程序会抛出以下异常?
  2. how to fix it?如何解决? --i see couple of stackoverflows pointing to this exception. --我看到几个 stackoverflows 指向这个异常。 But they say that it happens when you try to read when write is happening.但是他们说,当您尝试在写入发生时进行读取时,就会发生这种情况。 But my case is not like that.但我的情况并非如此。 i dont read when write happens.当写发生时我不读。
  3. my spark is 2.3.2 ;我的火花是2.3.2 EMR-5.18.1 ; EMR-5.18.1 the code is written in scala代码写在scala
  4. I am using s3:// as output folder path.我使用s3://作为 output 文件夹路径。 Should i change it to some s3n or s3a ?我应该将其更改为某些s3ns3a吗? will that help?那会有帮助吗?
Caused by: java.io.FileNotFoundException: No such file or directory 's3://BUCKET/snapshots/FOLDER/_bid_9223370368440344985/part-00020-693dfbcb-74e9-45b0-b892-0b19fa92365c-c000.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:104)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:101)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

I finally was able to solve the problem我终于能够解决问题

  1. The df: DataFrame was formed on the same s3 folder to which the same is being written in overwrite mode. df: DataFrame是在同一个s3文件夹上形成的,该文件夹正在以overwrite模式写入。

  2. So during the overwrite ;所以在overwrite期间; the source folder is getting cleared --which was resulting into the error源文件夹正在被清除——这导致了错误

Hope this helps somebody.希望这对某人有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 spark dataframe 写入云存储会引发错误 - Writing spark dataframe to Cloud Storage throws error 尝试将镶木地板文件写入 S3 存储桶时出现 PySpark SparkSession 错误:org.apache.spark.SparkException:写入行时任务失败 - PySpark SparkSession error when trying to write parquet files to S3 bucket: org.apache.spark.SparkException: Task failed while writing rows EMR 上 Spark 中的 S3 SlowDown 错误 - S3 SlowDown error in Spark on EMR 将 parquet 文件写入 S3 存储桶后 Apache Spark 挂起 - Apache Spark hangs after writing parquet file to S3 bucket 使用 TextIO.write() 写入 S3 时出错(谷歌数据流) - Error writing to S3 using TextIO.write() (Google Dataflow) Apache Flink - 将 stream 写入 S3 错误 - null uri 主机 - Apache Flink - writing stream to S3 error - null uri host Spark I/O 与 S3 - Spark i/o with S3 使用 HadoopInputFile 从 S3 读取文件会产生 FileNotFoundException - Reading files from S3 using HadoopInputFile yields FileNotFoundException AWS S3 存储桶通知 lambda 抛出异常(服务:Amazon S3;状态代码:404;错误代码:NoSuchKey) - AWS S3 bucket notification lambda throws exception (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey) GCS 数据流抛出错误 - FileNotFoundException(找不到此类文件或目录) - GCS Dataflow throws an error - FileNotFoundException(No such file or directory found)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM