简体繁体 English

Delta 表优化/真空

[英]Delta Table optimize/vacuum

原文 2021-12-21 09:41:37 1 1 apache-spark/ delta-lake

I have files being written by a kubernetes job(running on prem) into adls gen2 container in the form of Delta table.(spark on Kubernetes, which helps me in writing delta tables on adls)我有文件由 kubernetes 作业（在 prem 上运行）以增量表的形式写入 adls gen2 容器中。（Kubernetes 上的火花，这有助于我在 adls 上编写增量表）

files are huge in number flowing every hour ( small+big files) and we want to optimize/vacuum the delta table.每小时流动的文件数量巨大（小文件+大文件），我们想要优化/清理增量表。

is there is an automatic way / setting with which we can auto optimize&vacuum the delta table.是否有一种自动方式/设置可以自动优化和清理增量表。

i've read this article on auto optimization but its still unclear if this can help me.我已阅读有关自动优化的这篇文章，但仍不清楚这是否可以帮助我。

Thank you, Rahul Kishore谢谢你，拉胡尔·基肖尔

1 个解决方案

The linked article references the feature of the Delta on Databricks where it will try to produce bigger files when writing data - this is different from the automatic execution of OPTIMIZE/VACUUM.链接的文章引用了 Databricks 上的 Delta 的特性，它会在写入数据时尝试生成更大的文件 - 这与 OPTIMIZE/VACUUM 的自动执行不同。

Even on Databricks, you need to run VACUUM explicitly - just create a small Spark job that will execute VACUUM on selected table(s) - just follow documentation for correct syntax & settings.即使在 Databricks 上，您也需要显式运行 VACUUM - 只需创建一个将在选定表上执行 VACUUM 的小型 Spark 作业 - 只需按照文档获取正确的语法和设置即可。

Please note that OPTIMIZE is available only on Databricks, if you're using OSS Delta you can emulate it by reading all or part of data, repartition it for optimal file size & write it back in overwrite mode.请注意，OPTIMIZE 仅在 Databricks 上可用，如果您使用 OSS Delta，您可以通过读取全部或部分数据来模拟它，重新分区以获得最佳文件大小并以overwrite模式将其写回。 (be careful when you optimize only part of the data - use replaceWhere option as shown in documentation ) （仅优化部分数据时要小心 - 使用文档中所示的replaceWhere选项）