简体   繁体   English

Azure Synapse - 增量数据加载

[英]Azure Synapse - Incremental Data Load

We load data from on-prem database servers to Azure Data Lake Storage Gen2 using Azure Data Factory and Databricks store them as parquet files.我们使用Azure 数据工厂将数据从本地数据库服务器加载到Azure Data Lake Storage Gen2Databricks将它们存储为 parquet 文件。 Every run, we get only get the new and modified data from last run and UPSERT into existing parquet files using databricks merge statement.每次运行,我们只使用 databricks合并语句上次运行和 UPSERT 的新数据和修改数据获取到现有的镶木地板文件中。

Now we are trying to move this data from parquet files Azure Synapse .现在我们正在尝试从镶木地板文件Azure Synapse 中移动这些数据。 Ideally, I would like to do this.理想情况下,我想这样做。

  • Read incremental load data into a external table.将增量加载数据读入外部表。 (CETAS or COPY INTO) (CETAS 或 COPY INTO)
  • Use above as staging table.使用上面作为临时表。
  • Merge staging table with production table.将临时表与生产表合并。

The problem is merge statement is not available in Azure Syanpse.问题是合并语句在 Azure Synpse 中不可用。 Here is the solution Microsoft suggests for incremental load这是微软建议的增量加载解决方案

CREATE TABLE dbo.[DimProduct_upsert]
WITH
(   DISTRIBUTION = HASH([ProductKey])
,   CLUSTERED INDEX ([ProductKey])
)
AS
-- New rows and new versions of rows
SELECT      s.[ProductKey]
,           s.[EnglishProductName]
,           s.[Color]
FROM      dbo.[stg_DimProduct] AS s
UNION ALL  
-- Keep rows that are not being touched
SELECT      p.[ProductKey]
,           p.[EnglishProductName]
,           p.[Color]
FROM      dbo.[DimProduct] AS p
WHERE NOT EXISTS
(   SELECT  *
    FROM    [dbo].[stg_DimProduct] s
    WHERE   s.[ProductKey] = p.[ProductKey]
)
;

RENAME OBJECT dbo.[DimProduct]          TO [DimProduct_old];
RENAME OBJECT dbo.[DimProduct_upsert]  TO [DimProduct];

Basically dropping and re-creating the production table with CTAS.基本上是使用 CTAS 删除和重新创建生产表。 Will work fine with small dimenstion tables, but i'm apprehensive about large fact tables with 100's of millions rows with indexes.对小维度表可以正常工作,但我对具有索引的数百万行的大型事实表感到担忧。 Any suggestions what would be the best way to do incremental loads for really large fact tables.任何建议对真正大的事实表进行增量加载的最佳方法是什么。 Thanks!谢谢!

Till the time SQL MERGE is officially supported, the recommended way fwd to update target tables is to use T SQL insert/update commands between the delta records and target table.在正式支持 SQL MERGE 之前,推荐的更新目标表的方法是在增量记录和目标表之间使用 T SQL 插入/更新命令。

Alternatively, you can also use Mapping Data Flows (in ADF) to emulate SCD transactions for dimensional/fact data load.或者,您也可以使用映射数据流(在 ADF 中)来模拟 SCD 事务以加载维度/事实数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure Synapse - 管道 - 复制数据 - 没有表的增量加载 - Azure Synapse - Pipelines - Copy Data - incremental load without table 将数据加载到 Azure 突触分析的最佳方式 - Best way to load data into Azure synapse analytics 如何将 API 数据加载到 Azure Synapse 中? - How to load API data into Azure Synapse? 具有文件分区的Azure Data Lake增量加载 - Azure Data Lake incremental load with file partition Azure数据工厂| 从SFTP到Blob的增量数据加载 - Azure data factory | incremental data load from SFTP to Blob Azure 数据工厂增量加载,无需更改本地数据库 - Azure Data Factory Incremental Load without altering on premises database Azure Synapse 管道:如何将增量更新从 SQL Server 移动到突触以处理数字 - Azure Synapse pipeline: How to move incremental updates from SQL Server into synapse for crunching numbers 将数据从 Azure Synapse Serverless SQL 池批量加载到 Azure 存储或 Databricks Spark 的最佳方法 - Best way to bulk load data from Azure Synapse Serverless SQL pools into Azure storage or Databricks Spark Azure Synapse Analytics - 高长度的列负载 - Azure Synapse Analytics - column load with high length Azure Synapse 专用 sql 池未在 Synapse Studio 中显示数据对象 - Azure Synapse dedicated sql pool not showing data objects in synapse studio
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM