简体   繁体   English

spark delta 覆盖特定分区

[英]spark delta overwrite a specific partition

So I have a dataframe which has a column, file_date.所以我有一个数据框,其中有一列 file_date。 For a given run, the dataframe has only data for one unique file_date.对于给定的运行,数据框只有一个唯一的 file_date 的数据。 For instance, in a run, let us assume that there are say about 100 records with a file_date of 2020_01_21.例如,在一次运行中,让我们假设有大约 100 条记录的 file_date 为 2020_01_21。

I am writing this data using the following我正在使用以下内容编写此数据

(df
 .repartition(1)
 .write
 .format("delta")
 .partitionBy("FILE_DATE")
 .mode("overwrite")
 .option("overwriteSchema", "true")
 .option("replaceWhere","FILE_DATE=" + run_for_file_date)
 .mode("overwrite")
 .save("/mnt/starsdetails/starsGrantedDetails/"))

My requirement is to create a folder/partition for every FILE_DATE as there is a good chance that data for a specific file_date will be rerun and the specific file_date's data has to be overwritten.我的要求是为每个 FILE_DATE 创建一个文件夹/分区,因为很有可能重新运行特定 file_date 的数据并且必须覆盖特定 file_date 的数据。 Unfortunately in the above code, if I don't place the “replaceWhere” option, it just overwrites data for other partitions too but if I write the above, data seems to be overwriting correctly the specific partition but every time the write is done, I am getting the following error.不幸的是,在上面的代码中,如果我不放置“replaceWhere”选项,它也会覆盖其他分区的数据,但如果我写上面的代码,数据似乎会正确覆盖特定分区,但每次写入完成时,我收到以下错误。

Please note I have also set the following spark config before the write:请注意,我在写入之前还设置了以下 spark 配置:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")

But I am still getting the following error:但我仍然收到以下错误:

AnalysisException: "Data written out does not match replaceWhere 'FILE_DATE=2020-01-19'.\nInvalid data would be written to partitions FILE_DATE=2020-01-20.;"

Can you kindly help please.你能帮忙吗?

There are couple of things that need to be in mind while using replaceWhere to overwrite delta partition.使用replaceWhere覆盖 delta 分区时需要注意几件事。 Your dataframe must be filtered before writing into partitions for example we have dataframe DF :在写入分区之前必须过滤您的数据框,例如我们有数据框DF

在此处输入图像描述

When We write this dataframe into delta table then dataframe partition coulmn range must be filtered which means we should only have partition column values within our replaceWhere condition range.当我们将此数据帧写入增量表时,必须过滤数据帧分区列范围,这意味着我们应该只在 replaceWhere 条件范围内包含分区列值。

 DF.write.format("delta").mode("overwrite").option("replaceWhere",  "date >= '2020-12-14' AND date <= '2020-12-15' ").save( "Your location")

if we use condition date < '2020-12-15' instead of date <= '2020-12-15' it will give us error:如果我们使用条件 date < '2020-12-15' 而不是 date <= '2020-12-15' 它将给我们错误:

在此处输入图像描述

Other thing is partition column value needed in quotation '2020-12-15' otherwise chances are it will give error.另一件事是引用“2020-12-15”中需要的分区列值,否则它可能会出错。

There is also pull request open for delta overwrite partition spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") here https://github.com/delta-io/delta/pull/371 not sure if they are planning to introduce it.也有 delta 覆盖分区spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")在这里https://github.com/delta-io/delta/pull/371不确定如果他们打算介绍它。

I faced the same and realized, my dataframe had more values for the column that I tried to replace partition with.我遇到了同样的情况并意识到,我的数据框对我试图用其替换分区的列有更多的值。 You must filter your dataframe first and preapare your replaceWhere and write.您必须先过滤您的数据框并准备您的 replaceWhere 并写入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM