简体   繁体   中英

Spark: refresh Delta Table in S3

how can I run the refresh table command on a Delta Table in S3? When I do

deltatable = DeltaTable.forPath(spark, "s3a://test-bucket/delta_table/")
spark.catalog.refreshTable(deltatable)

I am getting the error:

AttributeError: 'DeltaTable' object has no attribute '_get_object_id'

Does the refresh command only work for Hive tables?

Thanks!

Ok. It's really an incorrect function - the spark.catalog.refreshTable function ( doc ) is used to refresh table metadata inside the Spark. It has nothing to do with recovery of the Delta table.

To fix this on Delta you need to do something different. Unfortunately I'm not 100% sure about right way for open source Delta implementation - on Databricks we haveFSCK REPAIR TABLE SQL command for that. I would try following ( be careful, make a backup! ):

  • If removed files were in the recent version, then you may try to use RESTORE command with spark.sql.files.ignoreMissingFiles set to true

  • If removed files were for the specific partition, then you can read the table (again with spark.sql.files.ignoreMissingFiles set to true ), leave data only for that partitions, and write data using overwrite mode with replaceWhere option ( doc ) that contains condition

  • Or you can read the whole Delta table (again with spark.sql.files.ignoreMissingFiles set to true ) and write it back in Overwrite mode - it will of course duplicate your data, but the old files will be removed by the VACUUM.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM