简体   繁体   English

将存储帐户 Azure 转换为 Databricks 增量表

[英]Convert storage account Azure into Databricks delta tables

I just linked an Azure storage account (Storage gen2) with its underlying containers to my Databricks environment.我刚刚将一个 Azure 存储帐户 (Storage gen2) 与其底层容器链接到我的 Databricks 环境。 Inside the storage account are two containers each with some subdirectories.存储帐户内部是两个容器,每个容器都有一些子目录。 Inside the folders are.csv files.文件夹里面是.csv 文件。

I have connected an Azure service principal with Azure Blog Data Contributor access to the storage account inside databricks so I can read and write to the storage account.我已将 Azure 服务主体与 Azure 博客数据贡献者访问数据块内的存储帐户连接,因此我可以读取和写入存储帐户。

I am trying to figure out the best way to convert the existing storage account into a delta lake (tables inside the metastore + convert the files to parquet (delta tables).我试图找出将现有存储帐户转换为增量湖的最佳方法(元存储内的表+将文件转换为镶木地板(增量表)。

What is the easiest way to do that?最简单的方法是什么?

My naive approach as a beginner might be我作为初学者的天真方法可能是

  1. Read the folder using spark.read.format("csv).load("{container}@{storage}..../directory)使用 spark.read.format("csv).load("{container}@{storage}..../directory) 读取文件夹

  2. Write to a new folder with similar name (so if folder is directory, write it to directory_parquet) using df.write.format("delta").save({container}@{storage}.../directory_parquet)使用 df.write.format("delta").save({container}@{storage}.../directory_parquet) 写入具有相似名称的新文件夹(因此,如果文件夹是目录,则将其写入 directory_parquet)

And then not sure on the last steps?然后不确定最后的步骤? This would create a new folder with a new set of files.这将创建一个包含一组新文件的新文件夹。 But it wouldn't be a table in databricks that shows up in the hive store.但它不会是出现在 hive 存储中的数据块中的表。 But I do get parquet files.但我确实得到了镶木地板文件。

Alternatively I can use df.write.format().saveAsTable("tablename") but that doesn't create the table in the storage account, but inside the databricks file system, but does show up in the hive metastore.或者,我可以使用 df.write.format().saveAsTable("tablename") 但这不会在存储帐户中创建表,而是在 databricks 文件系统中创建表,但确实显示在 hive 元存储中。

  1. delete the existing data files if desired (or have it duplicated)如果需要,删除现有数据文件(或复制)

Preferably this can be done in a Databricks workbook using python as preferred, or scala/sql if necessary.最好这可以在 Databricks 工作簿中使用 python 作为首选,或者在必要时使用 scala/sql。

*As a possible solution, if the efforts to do this are monumental, just converting to parquet and getting table information for each subfolder into hive storage as a format of database=containerName tableName=subdirectoryName *作为一种可能的解决方案,如果这样做的努力是巨大的,只需转换为 parquet 并将每个子文件夹的表信息以 database=containerName tableName=subdirectoryName 的格式获取到 hive 存储中

The folder structure is pretty flat at the moment, so only rootcontainer/Subfolders deep.文件夹结构目前非常平坦,因此只有 rootcontainer/Subfolders 深。

Perhaps an external table is what you're looking for:也许您正在寻找一个外部表:

df.write.format("delta").option("path", "some/external/path").saveAsTable("tablename") 

This post has more info on external tables vs managed tables.这篇文章有更多关于外部表和托管表的信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM