简体   繁体   English

Spark:覆盖分区文件夹

[英]Spark: overwrite partitioned folders

I have a workflow on Spark 3.1 and writing a dataframe in the end partitioned by year,month,day,hour to S3.我在 Spark 3.1 上有一个工作流,并在最后按年、月、日、小时分区编写 dataframe 到 S3。 I expect the files in each "folder" in S3 to be overwritten but they're always appended.我希望 S3 中每个“文件夹”中的文件都被覆盖,但它们总是被附加。 Any idea as to what might be the problem?关于可能是什么问题的任何想法?

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

df
  .write
  .mode(SaveMode.Overwrite)
  .partitionBy("year", "month", "day", "hour")
  .json(outputPath)

I suggest this version:我建议这个版本:

df
  .write
  .mode('overwrite')
  .partitionBy("year", "month", "day", "hour")
  .json(outputPath)

or this one:或者这个:

df
  .write
  .mode(SaveMode.Overwrite)
  .partitionBy("year", "month", "day", "hour")
  .json(outputPath)

For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents:对于旧版本的 Spark,您可以使用以下内容用 RDD 内容覆盖 output 目录:

sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)

It seems this is a bug on Spark 3.1.这似乎是 Spark 3.1 上的一个错误。 Downgrading to Spark 3.0.1 helps.降级到 Spark 3.0.1 会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM