简体   繁体   English

如何从 Kube.netes 向 Delta Lake 写入数据

[英]How to write data to Delta Lake from Kubernetes

Our organisation runs Databricks on Azure that is used by data scientists & analysts primarily for Notebooks in order to do ad-hoc analysis and exploration.我们的组织在 Azure 上运行 Databricks,数据科学家和分析师主要将其用于笔记本电脑,以便进行临时分析和探索。

We also run Kube.netes clusters for non spark-requiring ETL workflows.我们还为不需要 spark 的 ETL 工作流运行 Kube.netes 集群。

We would like to use Delta Lakes as our storage layer where both Databricks and Kube.netes are able to read and write as first class citizens.我们想使用 Delta Lakes 作为我们的存储层,Databricks 和 Kube.netes 都可以作为第一个 class 公民进行读写。
Currently our Kube.netes jobs write parquets directly to blob store, with an additional job that spins up a databricks cluster to load the parquet data into Databrick's table format.目前,我们的 Kube.netes 作业将 parquet 直接写入 blob 存储,还有一个额外的作业会启动一个 databricks 集群,以将 parquet 数据加载到 Databrick 的表格式中。 This is slow and expensive.这是缓慢且昂贵的。

What I would like to do is write to Delta lake from Kube.netes python directly, as opposed to first dumping a parquet file to blob store and then triggering an additional Databricks job to load it into Delta lake format.我想做的是直接从 Kube.netes python 写入 Delta lake,而不是先将 parquet 文件转储到 blob 存储,然后触发额外的 Databricks 作业以将其加载为 Delta lake 格式。
Conversely, I'd like to also leverage Delta lake to query from Kube.netes.相反,我还想利用 Delta lake 从 Kube.netes 进行查询。


In short, how do I set up my Kube.netes python environment such that it has equal access to the existing Databricks Delta Lake for writes & queries?简而言之,我如何设置我的 Kube.netes python 环境,以便它可以平等地访问现有的 Databricks Delta Lake 进行写入和查询?
Code would be appreciated.代码将不胜感激。

You can usually can write into the Delta table using Delta connector for Spark .通常可以使用Delta connector for Spark写入 Delta 表。 Just start a Spark job with necessary packages and configuration options :只需使用必要的包和配置选项启动 Spark 作业:

spark-submit --packages io.delta:delta-core_2.12:1.0.0 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" 
...

and write the same way as on Databricks:并以与 Databricks 相同的方式编写:

df.write.format("delta").mode("append").save("some_location")

But by using OSS version of Delta you may loose some of the optimizations that are available only on Databricks, like, Data Skipping , etc. - in this case performance for the data written from Kube.netes could be lower (really depends on how do you access data).但是通过使用 Delta 的 OSS 版本,您可能会失去一些仅在 Databricks 上可用的优化,例如Data Skipping等 - 在这种情况下,从 Kube.netes 写入的数据的性能可能会降低(实际上取决于如何做您访问数据)。

There could be a case when you couldn't write into Delta table create by Databricks - when the table was written by writer with writer version higher that supported by OSS Delta connector (see Delta Protocol documentation ).可能存在一种情况,当您无法写入由 Databricks 创建的 Delta 表时 - 当该表是由写入器编写的,其写入器版本高于 OSS Delta 连接器支持的版本(请参阅Delta 协议文档)。 For example, this happens when you enable Change Data Feed on the Delta table that performs additional actions when writing data.例如,当您在写入数据时执行额外操作的增量表上启用 更改数据馈送时,就会发生这种情况。

Outside of Spark, there are plans for implementing so-called Standalone writer for JVM-based languages (in addition to existing Standalone reader ).在 Spark 之外,有计划为基于 JVM 的语言实现所谓的Standalone writer (除了现有的Standalone reader之外)。 And there is a delta-rs project implemented in Rust (with bindings for Python & Ruby) that should be able to write into Delta table (but I haven't tested that myself)并且在 Rust 中实施了一个delta-rs 项目(绑定了 Python 和 Ruby)应该能够写入 Delta 表(但我自己还没有测试过)

Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0 2022 年 4 月 14 日更新:从 1.2.0 版开始,OSS Delta 也提供数据跳过功能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Azure Data Lake Storage Gen1 中将 Spark Dataframe 保存为 Delta Table 时,有没有办法在写入之前判断将创建多少个文件? - Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1? 从 pyspark 包下载与 Delta Lake 相关的 Jars 时出错 - Error downloading Jars related to Delta Lake from pyspark packages 如何在 Delta Lake 的 Python 中通过空运行调用真空 - How to call vacuum with a dry run in Python for a Delta Lake 使用Databricks将Google Api的结果写入数据湖 - Write the results of the Google Api to a data lake with Databricks Delta Lake 表存储分类 - Delta Lake Table Storage Sorting 如何在不使用 Pyspark 的情况下在 Python 中写入增量表/增量格式? - How to write to delta table/delta format in Python without using Pyspark? EMR笔记本上通过Pyspark访问delta Lake - Accessing delta lake through Pyspark on EMR notebooks 将 Delta Lake 包添加到 AWS EMR Notebook - Add Delta Lake packages to AWS EMR Notebook 我们可以删除增量湖中最新版本的增量表吗? - Can we delete latest version of delta table in the delta lake? 如何从Azure Data Lake Store中读取Azure Databricks中的JSON文件 - How to read a JSON file in Azure Databricks from Azure Data Lake Store
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM