简体   繁体   中英

How to overwrite Spark Dataset

I have an existing Spark Dataset in my application. I used

Dataset<Row> dataframe = dataframe.withColumn(colName, new Column);

to update it. Now my last step is to write it to the Parquet file.

dataframe.write().mode(SaveMode.Append).parquet(getDSPath(dataset).toString());

When I use an Append mode it adds to the existing Dataset thus creating duplicated rows. If I use "SaveMode.Overwrite" then an exception is thrown:

File file:/share/data/applocation/spark/DATASETUAT/part-00000-3124c90f-461f-4c13-a5b2-25064de0ce59-c000.snappy.parquet does not exist

What can I do to Overwrite an existing Dataset?

I resolved it. The trick is to create a temporary Parquet file in the different location. Then read from it into the new Dataset, then overwrite original Parquet file.

At the end it's necessary to remove temporary Parquet file and clear the Dataset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM