如何覆盖 Spark 数据集

Question

I have an existing Spark Dataset in my application.我的应用程序中有一个现有的 Spark 数据集。 I used我用了

Dataset<Row> dataframe = dataframe.withColumn(colName, new Column);

to update it.更新它。 Now my last step is to write it to the Parquet file.现在我的最后一步是将其写入 Parquet 文件。

dataframe.write().mode(SaveMode.Append).parquet(getDSPath(dataset).toString());

When I use an Append mode it adds to the existing Dataset thus creating duplicated rows.当我使用 Append 模式时，它会添加到现有数据集中，从而创建重复的行。 If I use "SaveMode.Overwrite" then an exception is thrown:如果我使用“SaveMode.Overwrite”，则会引发异常：

File file:/share/data/applocation/spark/DATASETUAT/part-00000-3124c90f-461f-4c13-a5b2-25064de0ce59-c000.snappy.parquet does not exist

What can I do to Overwrite an existing Dataset?我可以做些什么来覆盖现有的数据集？

Answer 1

I resolved it.我解决了。 The trick is to create a temporary Parquet file in the different location.诀窍是在不同的位置创建一个临时 Parquet 文件。 Then read from it into the new Dataset, then overwrite original Parquet file.然后将其读入新的数据集，然后覆盖原始 Parquet 文件。

At the end it's necessary to remove temporary Parquet file and clear the Dataset.最后，有必要删除临时 Parquet 文件并清除数据集。

如何覆盖 Spark 数据集

问题描述

1 个解决方案

解决方案1
0 2022-09-15 15:29:01

如何覆盖 Spark 数据集

问题描述

1 个解决方案

解决方案1 0 2022-09-15 15:29:01

解决方案1
0 2022-09-15 15:29:01