如何从 Kube.netes 向 Delta Lake 写入数据

Question

Our organisation runs Databricks on Azure that is used by data scientists & analysts primarily for Notebooks in order to do ad-hoc analysis and exploration.我们的组织在 Azure 上运行 Databricks，数据科学家和分析师主要将其用于笔记本电脑，以便进行临时分析和探索。

We also run Kube.netes clusters for non spark-requiring ETL workflows.我们还为不需要 spark 的 ETL 工作流运行 Kube.netes 集群。

We would like to use Delta Lakes as our storage layer where both Databricks and Kube.netes are able to read and write as first class citizens.我们想使用 Delta Lakes 作为我们的存储层，Databricks 和 Kube.netes 都可以作为第一个 class 公民进行读写。
Currently our Kube.netes jobs write parquets directly to blob store, with an additional job that spins up a databricks cluster to load the parquet data into Databrick's table format.目前，我们的 Kube.netes 作业将 parquet 直接写入 blob 存储，还有一个额外的作业会启动一个 databricks 集群，以将 parquet 数据加载到 Databrick 的表格式中。 This is slow and expensive.这是缓慢且昂贵的。

What I would like to do is write to Delta lake from Kube.netes python directly, as opposed to first dumping a parquet file to blob store and then triggering an additional Databricks job to load it into Delta lake format.我想做的是直接从 Kube.netes python 写入 Delta lake，而不是先将 parquet 文件转储到 blob 存储，然后触发额外的 Databricks 作业以将其加载为 Delta lake 格式。
Conversely, I'd like to also leverage Delta lake to query from Kube.netes.相反，我还想利用 Delta lake 从 Kube.netes 进行查询。

In short, how do I set up my Kube.netes python environment such that it has equal access to the existing Databricks Delta Lake for writes & queries?简而言之，我如何设置我的 Kube.netes python 环境，以便它可以平等地访问现有的 Databricks Delta Lake 进行写入和查询？
Code would be appreciated.代码将不胜感激。

Answer 1

You can usually can write into the Delta table using Delta connector for Spark .您通常可以使用Delta connector for Spark写入 Delta 表。 Just start a Spark job with necessary packages and configuration options :只需使用必要的包和配置选项启动 Spark 作业：

spark-submit --packages io.delta:delta-core_2.12:1.0.0 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" 
...

and write the same way as on Databricks:并以与 Databricks 相同的方式编写：

df.write.format("delta").mode("append").save("some_location")

But by using OSS version of Delta you may loose some of the optimizations that are available only on Databricks, like, Data Skipping , etc. - in this case performance for the data written from Kube.netes could be lower (really depends on how do you access data).但是通过使用 Delta 的 OSS 版本，您可能会失去一些仅在 Databricks 上可用的优化，例如Data Skipping等 - 在这种情况下，从 Kube.netes 写入的数据的性能可能会降低（实际上取决于如何做您访问数据）。

There could be a case when you couldn't write into Delta table create by Databricks - when the table was written by writer with writer version higher that supported by OSS Delta connector (see Delta Protocol documentation ).可能存在一种情况，当您无法写入由 Databricks 创建的 Delta 表时 - 当该表是由写入器编写的，其写入器版本高于 OSS Delta 连接器支持的版本（请参阅Delta 协议文档）。 For example, this happens when you enable Change Data Feed on the Delta table that performs additional actions when writing data.例如，当您在写入数据时执行额外操作的增量表上启用更改数据馈送时，就会发生这种情况。

Outside of Spark, there are plans for implementing so-called Standalone writer for JVM-based languages (in addition to existing Standalone reader ).在 Spark 之外，有计划为基于 JVM 的语言实现所谓的Standalone writer （除了现有的Standalone reader之外）。 And there is a delta-rs project implemented in Rust (with bindings for Python & Ruby) that should be able to write into Delta table (but I haven't tested that myself)并且在 Rust 中实施了一个delta-rs 项目（绑定了 Python 和 Ruby）应该能够写入 Delta 表（但我自己还没有测试过）

Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0 2022 年 4 月 14 日更新：从 1.2.0 版开始，OSS Delta 也提供数据跳过功能

如何从 Kube.netes 向 Delta Lake 写入数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-08-14 11:14:08

如何从 Kube.netes 向 Delta Lake 写入数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-08-14 11:14:08

解决方案1
1 已采纳 2021-08-14 11:14:08