How to overwrite Spark Dataset

Question

I have an existing Spark Dataset in my application. I used

Dataset<Row> dataframe = dataframe.withColumn(colName, new Column);

to update it. Now my last step is to write it to the Parquet file.

dataframe.write().mode(SaveMode.Append).parquet(getDSPath(dataset).toString());

When I use an Append mode it adds to the existing Dataset thus creating duplicated rows. If I use "SaveMode.Overwrite" then an exception is thrown:

File file:/share/data/applocation/spark/DATASETUAT/part-00000-3124c90f-461f-4c13-a5b2-25064de0ce59-c000.snappy.parquet does not exist

What can I do to Overwrite an existing Dataset?

Answer 1

I resolved it. The trick is to create a temporary Parquet file in the different location. Then read from it into the new Dataset, then overwrite original Parquet file.

At the end it's necessary to remove temporary Parquet file and clear the Dataset.

How to overwrite Spark Dataset

Question

1 answers

solution1
0 2022-09-15 15:29:01

How to overwrite Spark Dataset

Question

1 answers

solution1 0 2022-09-15 15:29:01

solution1
0 2022-09-15 15:29:01