Spark: overwrite partitioned folders

Question

I have a workflow on Spark 3.1 and writing a dataframe in the end partitioned by year,month,day,hour to S3. I expect the files in each "folder" in S3 to be overwritten but they're always appended. Any idea as to what might be the problem?

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

df
  .write
  .mode(SaveMode.Overwrite)
  .partitionBy("year", "month", "day", "hour")
  .json(outputPath)

Answer 1

I suggest this version:

df
  .write
  .mode('overwrite')
  .partitionBy("year", "month", "day", "hour")
  .json(outputPath)

or this one:

df
  .write
  .mode(SaveMode.Overwrite)
  .partitionBy("year", "month", "day", "hour")
  .json(outputPath)

For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents:

sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)

Answer 2

It seems this is a bug on Spark 3.1. Downgrading to Spark 3.0.1 helps.

Spark: overwrite partitioned folders

Question

2 answers

solution1
0 2022-01-11 20:46:09

solution2
0 2022-01-12 17:59:13

Spark: overwrite partitioned folders

Question

2 answers

solution1 0 2022-01-11 20:46:09

solution2 0 2022-01-12 17:59:13

solution1
0 2022-01-11 20:46:09

solution2
0 2022-01-12 17:59:13