简体   繁体   中英

How to create a blank "Delta" Lake table schema in Azure Data Lake Gen2 using Azure Synapse Serverless SQL Pool?

I have a file with data integrated from 2 different sources using Azure Mapping Data Flow and loaded into an ADLS2 datalake container/folder ie for example:- /staging/EDW/Current/products.parquet file.

I now need to process this file in staging using Azure Mapping Data Flow and load into it's corresponding dimension table using SCD type2 method to maintain history.

However, I want to try creating & process this dimension table as "Delta" table in Azure Data Lake using Azure Mapping Data Flow only. However, since SCD type 2 requires a source lookup to check if there are any existing records/rows and if not insert all or if changed records do updates etc etc. (let's say during first time load).

For that, I need to first create a default/blank "Delta" table in Azure data lake folder ie for example:- /curated/Delta/Dimension/Products/. Just like we would have done if it were in Azure SQL DW (Dedicated Pool) in which we could have first created a blank dbo.dim_products table with just the schema/structure and no rows.

I am trying to implement a DataLake-House architecture implementation by utilizing & evaluating the best features of both Delta Lake and Azure Synapse Serverless SQL pool using Azure Mapping data flow - for performance, cost savings, ease of development (low code) & understanding. However, at the same time want to avoid a Logical Datawarehouse (LDW) kind of architecture implementation at this time.

For this, tried creating a new database under built-in Azure Synapse Serverless SQL pool, defined data source, format and a blank delta table/schema structure (without any rows); but no luck.

create database delta_dwh;

create external data source deltalakestorage
with ( location = 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/' );

create external file format deltalakeformat 
with (format_type = delta);

drop external table products;
create external table dbo.products
(
product_skey int,
product_id int,
product_name nvarchar(max),
product_category nvarchar(max),
product_price decimal (38,18),
valid_from date,
valid_to date,
is_active char(1)
)
with
(
    location='https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products',
    data_source = deltalakestorage,
    file_format = deltalakeformat
);

However, this fails since a Delta table/file requires _delta_log/*.json folder/file to be present which maintains transaction log. That means, I have to first write few (dummy) rows as in Delta format to the said target folder and then only I can read it and perform following queries used in for SCD type 2 implementation:

select isnull(max(product_skey), 0) 
FROM OPENROWSET(
BULK 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products/*.parquet',
FORMAT = 'DELTA') as rows

Any thoughts, inputs, suggestions??

Thanks!

You may try to create initial /dummy data_flow + pipiline to create this empty delta files.

It's only simple workaround.

  1. Create CSV with your sample table data.
  2. Create dataflow with name =initDelta
  3. Use this CSV as source in data flow
  4. In projection panel set up correct data types.
  5. Add filtering after source and setup dummy filter 1=2 etc.
  6. Add sink with delta output.
  7. Put your initDelta dataflow into dummy pipeline and run it.
  8. Folder structure for delta should created.

You mentioned the your initial data is in parque file. You can use this file. Schema of table(columns and data types) will be imported from file. Filter out all rows and save result as delta.

I think it should work or I missed something in your problem

I don't think you can use Serverless SQL pool to create a delta table........yet. I think it is coming soon though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM