I have a workflow on Spark 3.1 and writing a dataframe in the end partitioned by year,month,day,hour to S3. I expect the files in each "folder" in S3 to be overwritten but they're always appended. Any idea as to what might be the problem?
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df
.write
.mode(SaveMode.Overwrite)
.partitionBy("year", "month", "day", "hour")
.json(outputPath)
I suggest this version:
df
.write
.mode('overwrite')
.partitionBy("year", "month", "day", "hour")
.json(outputPath)
or this one:
df
.write
.mode(SaveMode.Overwrite)
.partitionBy("year", "month", "day", "hour")
.json(outputPath)
For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents:
sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)
It seems this is a bug on Spark 3.1. Downgrading to Spark 3.0.1 helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.