简体   繁体   中英

Difference between Delta Lake and Lake Database in Azure Synapse

I'm building a lakehouse architecture in Azure Synapse and am in doubt between using Delta-lake or a Lake database.

Both seem to have roughly the same functionality - I can use Spark to do ETL tasks - and then use spark pools as well as serverless sql pools to query data.

In Azure documentation , a lake database is defined as:

"A lake database provides a relational metadata layer over one or more files in a data lake. You can create a lake database that includes definitions for tables, including column names and data types as well as relationships between primary and foreign key columns. The tables reference files in the data lake, enabling you to apply relational semantics to working with the data and querying it using SQL. However, the storage of the data files is decoupled from the database schema; enabling more flexibility than a relational database system typically offers."

Whereas Delta Lake is defined as:

Delta Lake is an open-source storage layer that adds relational database semantics to Spark-based data lake processing. Delta Lake is supported in Azure Synapse Analytics Spark pools for PySpark, Scala, and .NET code.

The benefits of using Delta Lake in a Synapse Analytics Spark pool include:

Relational tables that support querying and data modification. With Delta Lake, you can store data in tables that support CRUD (create, read, update, and delete) operations. In other words, you can select, insert, update, and delete rows of data in the same way you would in a relational database system.

What are the differences between Delta lake and Lake Database (if any) in Azure Synapse? Or are they simply two different tools to achieve roughly the same results? Are there concrete benefits of using one over the other?

The Lake Database is a facility that Microsoft added to Synapse Analytics that uses Spark SQL (Hive) managed tables to provide the database abstraction layer to your Parquet, csv or Delta tables. It uses the Hive Metastore, which keeps track of database contents: tables, schemas, views, etc. If you use Delta tables in it, you will have all the additional metadata that is part of the change tracking of Delta Lake, but Delta table metadata is not part of the Lake Database Metastore. I am using the free Linux distribution of Delta Lake.

If you configure your Delta Lake properly, you can get it to appear in Synapse Studio as a Lake Database. One advantage of the Lake Database is that in Synapse data flows, instead of Integration Dataset you can use the Workspace DB source type, which is for the Lake Databases and it uses the database and table model instead of working with a bunch of integration datasets that you have to define.

I am in the process of setting this up for a client and still discovering the details. Documentation is plentiful for the different pieces, but nothing exists for the whole, how to configure it, and how it all it works together. So please excuse any inaccurate statements here. There are many nuances to know in order to integrate the open-source Delta Lake into the Lake Database, and Synapse pipelines. What you get with this stack should be similar to what you get in the Databricks version of Delta Lake, except here the configuration is all on you and you have to have some luck figuring it out.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM