简体   繁体   English

从 delta lake 手动删除数据文件

[英]Manually Deleted data file from delta lake

I have manually deleted a data file from delta lake and now the below command is giving error我已经从 delta lake 中手动删除了一个数据文件,现在下面的命令出错了

mydf = spark.read.format('delta').load('/mnt/path/data')
display(mydf)

Error错误

A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions

i have tried restarting the cluster with no luck also tried the below我试过重新启动集群但没有运气也尝试了下面的

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
spark.conf.set("spark.databricks.io.cache.enabled", "false")

Any help on repairing the transaction log or fix the error有关修复事务日志或修复错误的任何帮助

The above error indicates that you have manually deleted a data file without using the proper DELETE Statement.上述错误表明您在没有使用正确的DELETE语句的情况下手动删除了一个数据文件。

As per MS Doc , you can try vacuum command.根据MS Doc ,您可以尝试vacuum命令。 Using the vacuum command fix the error.使用vacuum命令修复错误。

%sql
vacuum 'Your_path'

For more information refer this link有关更多信息,请参阅链接

as explained before you must use vacuum to remove files as manually deleting files does not lead to the delta transaction log being updated which is what spark uses to identify what files to read.如前所述,您必须使用 vacuum 删除文件,因为手动删除文件不会导致更新增量事务日志,这是 spark 用来识别要读取的文件的内容。

In your case you can also use the FSCK REPAIR TABLE command.在您的情况下,您还可以使用FSCK REPAIR TABLE命令。 as per the docs: "Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted."根据文档:“从 Delta 表的事务日志中删除无法再在基础文件系统中找到的文件条目。手动删除这些文件时可能会发生这种情况。”

FSCK Command worked for me. FSCK 命令对我有用。 Thanks All谢谢大家

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Databricks 上的 Apache Spark 将文件写入 delta lake 会产生与读取 Data Frame 不同的结果 - Writing out file to delta lake produces different results from Data Frame read using Apache Spark on Databricks 将数据从本地 sql 服务器复制到 Azure Data Lake Storage Gen2 中的增量格式 - copy data from on premise sql server to delta format in Azure Data Lake Storage Gen2 同时将数据从 Sql 服务器复制到 Databricks delta lake(sql notebook 活动)空白值填充为 Null 值 - while copying data from Sql server to Databricks delta lake(sql notebook activity) Blank values populating as Null values 在 Azure 上的数据湖中的增量表中创建了多少个版本 - How many versions are created in a delta table in a Data lake on Azure AWS Glue 可以爬取 Delta Lake 表数据吗? - Can AWS Glue crawl Delta Lake table data? 从 s3 路径文件中删除增量文件数据 - Deleting delta files data from s3 path file 如何从Azure数据湖(存储账户)中读取R中的一个数据文件 - How to read a data file in R from Azure data lake (storage account) 如何使用 Airflow 将 CSV 文件从 Azure Data Lake/Blob 存储传输到 PostgreSQL 数据库 - How to transfer a CSV file from Azure Data Lake/Blob Storage to PostgreSQL database with Airflow 如何使用 for each 和 dataflow 将 excel 文件从 Azure 数据湖复制到 Azure SQL 数据库? - How to use for each and dataflow to copy excel file from Azure data lake to Azure SQL database? 胶水无法识别 Delta Lake Python 库 - Glue not able to recognize Delta Lake Python Library
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM