简体繁体 English

使用 Azure 数据工厂的多步增量加载和处理

[英]Multi Step Incremental load and processing using Azure Data Factory

原文 2022-03-04 03:12:24 9 2 etl/ azure-data-factory/ pipeline/ batch-processing

I wanted to achieve an incremental load/processing and store them in different places using Azure Data Factory after processing them, eg:我想实现增量加载/处理，并在处理后使用Azure 数据工厂将它们存储在不同的地方，例如：

External data source ( data is structured ) -> ADLS ( Raw ) -> ADLS ( Processed ) -> SQL DB外部数据源（数据是结构化的）-> ADLS（原始）-> ADLS（已处理）-> SQL DB

Hence, I will need to extract a sample of the raw data from the source, based on the current date, store them in an ADLS container, then process the same sample data, store them in another ADLS container, and finally append the processed result in a SQL DB .因此，我需要根据当前日期从源中提取原始数据样本，将它们存储在 ADLS 容器中，然后处理相同的样本数据，将它们存储在另一个 ADLS 容器中，最后 append 处理结果在SQL 数据库中。

ADLS raw : ADLS 原始：

2022-03-01.txt

2022-03-02.txt

ADLS processed : ADLS 处理：

2022-03-01-processed.txt

2022-03-02-processed.txt

SQL DB : SQL 数据库：

All the txt files in the ADLS processed container will be appended and stored inside SQL DB . ADLS 处理容器中的所有 txt 文件将被追加并存储在SQL DB中。

Hence would like to check what will be the best way to achieve this in a single pipeline that has to be run in batches?因此想检查在必须分批运行的单个管道中实现此目的的最佳方法是什么？

2 个解决方案

You can achieve this using a dynamic pipeline as follows:您可以使用动态管道实现此目的，如下所示：

Create a Config / Metadata.table in SQL DB wherein you would place the details like source table name, source name etc.在 SQL DB 中创建一个 Config / Metadata.table，您可以在其中放置源表名称、源名称等详细信息。
Create a pipeline as follows:创建管道如下：
a) Add a lookup activity wherein you would create a query based on your Config table https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity a) 添加一个查找活动，您将在其中创建一个基于您的配置表https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity的查询
b) Add a ForEach activity and use Lookup output as an input to ForEach https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity b) 添加 ForEach 活动并使用 Lookup output 作为 ForEach 的输入https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity
c) Inside ForEach you can add a switch activity where each Switch case distinguishes table or source c) 在 ForEach 中，您可以添加一个 switch 活动，其中每个 Switch case 区分表或源
d) In each case add a COPY or other activities which you need to create file in RAW layer d) 在每种情况下添加一个 COPY 或您需要在 RAW 层中创建文件的其他活动
e) Add another ForEach in your pipeline for Processed layer wherein you can add similar type of inner activities as you did for RAW layer and in this activity you can add processing logic e) 在处理层的管道中添加另一个 ForEach，您可以在其中添加与为 RAW 层所做的类似类型的内部活动，并且在此活动中您可以添加处理逻辑

This way you can create a single pipeline and that too a dynamic one which can perform necessary operations for all sources这样你就可以创建一个单一的管道，也可以创建一个动态的管道，它可以对所有源执行必要的操作

You can't rename multiple files at once so you have to copy files one after the other.您不能一次重命名多个文件，因此您必须一个接一个地复制文件。

Create a pipeline with tumbling window trigger - create two parameters in the trigger and pipeline named WindowStartTime and WindowEndTime Create a pipeline with tumbling window trigger - 在名为 WindowStartTime 和 WindowEndTime 的触发器和管道中创建两个参数
Create a GetMetaData activity use the parameter last modified datetime and pass WindowStartTime and WindowEndTime to get list of files that were placed between WindowStartTime and WindowEndTime创建一个GetMetaData活动，使用参数 last modified datetime 并传递 WindowStartTime 和 WindowEndTime 以获取放置在 WindowStartTime 和 WindowEndTime 之间的文件列表
Create a ForEach activity pass the data received from Getmetadata创建一个ForEach活动，传递从Getmetadata收到的数据
Create copy activity inside for activity and pass the file name from ForEach loop在活动内部创建复制活动并从ForEach循环传递文件名
In the sink dataset pass file name and concatenate "_processed/txt"在接收器数据集中传递文件名并连接“_processed/txt”
Create a Copy activity after the for each activity with source as processed layer again pass WindowStartTime and WindowEndTime在为每个以源作为处理层的活动再次传递 WindowStartTime 和 WindowEndTime 之后创建复制活动
This Copy activity will read the latest files received on the current day and append it to SQL DB此复制活动将读取当天收到的最新文件 append 到 SQL DB