简体   繁体   中英

Using Delta Tables in Azure Synapse Dedicated/Serverless SQL Pools

I am currently employed as a Junior Data Developer and recently saw a post saying that Azure Synapse can now create SQL tables from Delta tables. I tried creating an SQL table from a Delta table which is inside a Delta lake Storage V2, but the table is being populated with extra redundant data (all the data from all snapshots in the folder) when using 'PARQUET' as a file format and wildcard to read the files.

I tried creating an external file format for my table but Synapse doesn't accept 'DELTA' as a datatype. I used 'PARQUET' as a file format and used VACUUM on my Delta table to keep only the latest snapshot of it. Whenever I set the path to a specific file or once there was only a single snappy.parquet file in the Delta table, data was printed properly.

Basically is there any way to create a Synapse Table/External Table that get it's data from a Delta table?If not is there any way to stop Azure Deltalake from creating a new snapshot every time new data is written/updated/deleted?

Script used:

IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = SynapseParquetFormat') 
CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat] 
WITH ( FORMAT_TYPE = PARQUET)
GO

IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'ExtSource') 
CREATE EXTERNAL DATA SOURCE [ExtSource] 
WITH (
    LOCATION   = '*', 
)
GO

CREATE EXTERNAL TABLE dbo.ext_table (
    [CostCentre] varchar(8000),
    [CostCentre_MemberId] int
)
WITH (
    LOCATION = 'dimensions/Dim_Example/*.snappy.parquet',
    -- WILDCARD IF THERE IS ONLY ONE FILE OR LATEST FILE NEEDS TO BE SPECIFIED
    DATA_SOURCE = [ExtSource],
    FILE_FORMAT = [SynapseParquetFormat]
)
GO

/* '*' used because clients' data paths were used */

Query Delta Lake format in serverless Synapse SQL pool is currently in public preview.

Note: This preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.

This feature is in public preview and there are some known issues and limitations. Review the known issues on Synapse serverless SQL pool self-help page .

For more details, refer to Query Delta Lake files (preview) using serverless SQL pool in Azure Synapse Analytics .

Delta is not supported in Synapse. You can vote for this item here: https://feedback.azure.com/forums/307516-azure-synapse-analytics/suggestions/41120794-delta-lake-support-to-sql-on-demand

I serverless pool there are some workarounds:

You can use this powershell script to generate a DeltaLake view based on the current data: https://github.com/JocaPC/tsql-delta-lake-view/blob/main/GenDeltaLakeView.ps1

Note that you will need to re-generate the view if the data changes.

As an alternative, you can create delta view based on manifest file: https://giangnt2.medium.com/directly-reading-delta-lake-data-from-azure-synapse-8b911dce2bcd

Again you will need to re-generate it if something changes.

Currently, there is no DELTA-format in the Azure Synapse Dedicated SQL Pool for external tables. You cannot create a table within a SQL Pool that can read the Delta-format. Even though you can solve your problem with a PARQUET-format and use Vacuum, as you mentioned, it's not a recommended solution for everyday data-operations. From Databricks documentation:

"We do not recommend that you set a retention interval shorter than 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If vacuum cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when vacuum deletes files that have not yet been committed." https://docs.databricks.com/delta/delta-utility.html

And to answer your second question, I don't know any solution to stop creating snapshots. One of the key features of Delta Lake is to provide these snapshots.

A suggestion from my side is to use either Data Factory or a Spark application for data movement, reading from Delta-tables and writing to a table in Dedicated SQL Pool.

  1. With Data Factory you have built in connector for Delta-tables, but you'll need a Databricks-cluster to connect and read the data with Data Factory. Use either Copy Activity or Mapping Data Flow to read from Delta and write to a SQL Pool. Alternatively, read from Delta, write to Parquet and create external table in SQL Pool.

  2. The second option is to use a Spark-application. Read data from delta-table into a Spark DataFrame and write it to the SQL Pool.

Hosting of the Spark application could be done in either Azure Databricks or Spark-pools for Azure Synapse Analytics. For Azure Databricks there's an Azure Synapse Connector, more info here: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html#language-python

The latest version of Dedicate SQL pool is supporting DELTA format. Please refer to the below links for creating them, CREATE EXTERNAL DATA SOURCE --> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver16 CREATE EXTERNAL FILE FORMAT --> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-ver16 CREATE EXTERNAL TABLE --> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver16

Hope this helps you:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM