简体   繁体   English

我们可以删除增量湖中最新版本的增量表吗?

[英]Can we delete latest version of delta table in the delta lake?

I have a delta table with 4 versions.我有一个包含 4 个版本的增量表。

DESCRIBE HISTORY cfm ---> has 4 versions.描述历史 cfm ---> 有 4 个版本。 0,1,2,3. 0,1,2,3。

I want to delete version 3 or 2. How can I achieve this?我想删除版本 3 或 2。我该如何实现?

i tried我试过

from delta.tables import *
from pyspark.sql.functions import *

deltaTable = DeltaTable.forPath(spark, "path of cfm files")

deltaTable.delete("'version' = '3'") 

This does not delete the version 3. https://docs.delta.io/0.4.0/delta-update.html says这不会删除版本3。https://docs.delta.io/0.4.0/delta-update.html

"delete removes the data from the latest version of the Delta table but does not remove it from the physical storage until the old versions are explicitly vacuumed" “删除从最新版本的 Delta 表中删除数据,但在明确清除旧版本之前不会将其从物理存储中删除”

If i have to run vacuum command how to use them on latest dates and not older dates.如果我必须运行 vacuum 命令如何在最新日期而不是旧日期使用它们。

You need to use the vaccum command to perform this operation.您需要使用 vaccum 命令来执行此操作。 However the default retention for vaccum is for 7 days and it will error out if you are trying to vaccum anything within 7 days.但是,vaccum 的默认保留时间为 7 天,如果您在 7 天内尝试 vaccum 任何内容,它将出错。

We can work around this by setting a spark configuration that will bypass the default retention period check.我们可以通过设置绕过默认保留期检查的 spark 配置来解决此问题。

solution below:解决方案如下:

from delta.tables import *

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
deltaTable = DeltaTable.forPath(spark, deltaPath)
deltaTable.vacuum(24)

*deltaPath -- is the path for your delta table *deltaPath -- 是增量表的路径

*24 -- indicates the number of hours up until which your versioning is retained, any versions which were created beyond 24 hours in the past would get deleted. *24 -- 表示保留版本控制的小时数,过去 24 小时后创建的任何版本都将被删除。

Hope this answers your question.希望这能回答您的问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 Delta Lake 包添加到 AWS EMR Notebook - Add Delta Lake packages to AWS EMR Notebook EMR笔记本上通过Pyspark访问delta Lake - Accessing delta lake through Pyspark on EMR notebooks 如何从 Azure ADLS Gen 1 在 Azure 机器学习工作室中注册特定版本的增量表? - How can I register a specific version of a Delta Table in Azure Machine Learning Studio from Azure ADLS Gen 1? 在 Azure Data Lake Storage Gen1 中将 Spark Dataframe 保存为 Delta Table 时,有没有办法在写入之前判断将创建多少个文件? - Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1? 将 delta lake 写入 AWS S3(没有 Databricks) - Writing delta lake to AWS S3 (Without Databricks) 在 spark/delta 湖中同时更改多列评论 - Alter multiple column comments simultaneously in spark/delta lake 如何在 Delta Lake 的 Python 中通过空运行调用真空 - How to call vacuum with a dry run in Python for a Delta Lake 从 pyspark 包下载与 Delta Lake 相关的 Jars 时出错 - Error downloading Jars related to Delta Lake from pyspark packages 如何在不使用 Pyspark 的情况下在 Python 中写入增量表/增量格式? - How to write to delta table/delta format in Python without using Pyspark? 在数据块中创建具有当前日期的版本副本后,将增量表恢复到以前的版本 - Restore delta table to previous version after creating a copy of version with current date in databricks
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM