简体   繁体   中英

Spark avoid partition overwrite

I am writing a Spark Application that saves log data into a directory /logroot .

My code is

myDF.mode('overwrite').partitionBy('date','site').save('logroot')

I want to use the overwrite mode in order to re-process many times a week all the daily data.

My concern is that overwrite cleans all the logroot directory and not only the partitions involved.

How can I solve this problem?

At the moment of writing the best solution seems:

  • Extract from the initial dataframe the partition names that should be cleaned
  • Clean these partitions using hadoop fs api
  • Save dataframe using the append mode

Thanks to all for the help and hope Spark guys will provide a more elegant solution option.

Roberto

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM