简体   繁体   English

具有文件分区的Azure Data Lake增量加载

[英]Azure Data Lake incremental load with file partition

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory. 我正在设计数据工厂管道,以将数据从Azure SQL DB加载到Azure数据工厂。

My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL. 我最初的加载/ POC是一小部分数据,并且能够从SQL表加载到Azure DL。

Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL. 现在,我想使用DF从SQL DB加载到Azure DL的表数量巨大(甚至超过十亿)。 MS docs mentioned two options, ie watermark columns and change tracking. MS docs提到了两个选项,即水印列和更改跟踪。 Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt". 假设我有一个“ cust_transaction”表,该表具有数百万行,如果我加载到DL,则它将作为“ cust_transaction.txt”加载。 Questions. 问题。

1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake? 1)将SQL DB中的源数据增量加载到数据湖中的该文件中的最佳设计是什么?

2) How do I split or partition the files into smaller files? 2)如何将文件分割或分割为较小的文件?

3) How should I merge and load the deltas from source data into the files? 3)如何合并源数据中的增量并将其加载到文件中? Thanks. 谢谢。

You will want multiple files. 您将需要多个文件。 Typically, my data lakes have multiple zones. 通常,我的数据湖有多个区域。 The first zone is Raw. 第一个区域是Raw。 It contains a copy of the source data organized into entity/year/month/day folders where entity is a table in your SQL DB. 它包含组织到entity / year / month / day文件夹中的源数据的副本,其中,entity是SQL DB中的表。 Typically, those files are incremental loads. 通常,这些文件是增量负载。 Each incremental load for an entity has a file name similar to Entity_YYYYMMDDHHMMSS.txt (and maybe even more info than that) rather than just Entity.txt. 实体的每个增量加载都具有类似于Entity_YYYYMMDDHHMMSS.txt的文件名(甚至可能包含更多信息),而不仅仅是Entity.txt。 And the timestamp in the file name is the end of the incremental slice (max possible insert or update time in the data) rather than just current time wherever possible (sometimes they are relatively the same and it doesn't matter, but I tend to get a consistent incremental slice end time for all tables in my batch). 文件名中的时间戳记是增量片的结尾(数据中可能的最大插入或更新时间),而不是可能的时候只是当前时间(有时它们是相对相同的,没关系,但是我倾向于获得我批次中所有表的一致增量切片结束时间)。 You can achieve the date folders and timestamp in the file name by parameterizing the folder and file in the dataset . 通过参数化数据集中的文件夹和文件,可以在文件名中获得日期文件夹和时间戳。

Melissa Coates has two good articles on Azure Data Lake: Zones in a Data Lake and Data Lake Use Cases and Planning . 梅利莎·科茨(Melissa Coates)在Azure Data Lake上有两篇不错的文章:Data Lake中的区域Data Lake用例与计划 Her naming conventions are a bit different than mine, but both of us would tell you to just be consistent. 她的命名约定与我的命名约定有所不同,但是我们两个都告诉您保持一致。 I would land the incremental load file in Raw first. 我将首先在Raw中加载增量加载文件。 It should reflect the incremental data as it was loaded from the source. 它应反映从源加载的增量数据。 If you need to have a merged version, that can be done with Data Factory or U-SQL (or your tool of choice) and landed in the Standardized Raw zone. 如果您需要合并的版本,则可以使用Data Factory或U-SQL(或您选择的工具)来完成并合并到Standardized Raw区域中。 There are some performance issues with small files in a data lake, so consolidation could be good, but it all depends on what you plan to do with the data after you land it there . 数据湖中的小文件存在一些性能问题 ,因此合并可能会很好,但是这完全取决于您将数据放到那里后打算如何处理 Most users would not access data in the RAW zone, instead using data from Standardized Raw or Curated Zones. 大多数用户不会使用RAW区域中的数据,而是使用标准化原始或策展区域中的数据。 Also, I want Raw to be an immutable archive from which I could regenerate data in other zones, so I tend to leave it in the files as it landed. 另外,我希望Raw成为一个不变的档案,我可以从该档案中重新生成其他区域中的数据,因此我倾向于在着陆时将其保留在文件中。 But if you found you needed to consolidate there, that would be fine. 但是,如果您发现需要在那儿进行整合,那很好。

Change tracking is a reliable way to get changes, but I don't like their naming conventions/file organization in their example . 更改跟踪是获取更改的可靠方法,但是我不喜欢其示例中的命名约定/文件组织。 I would make sure your file name has the entity name and a timestamp on it. 我将确保您的文件名上具有实体名称和时间戳。 They have Incremental - [PipelineRunID]. 它们具有增量-[PipelineRunID]。 I would prefer [Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (or leave the run ID off) because it is more informative to others. 我希望使用[Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (或保留运行ID),因为它对其他人更有帮助。 I also tend to use the Trigger ID rather than the pipeline RunID. 我也倾向于使用触发器ID而不是管道RunID。 The Trigger ID is across all the packages executed in that trigger instance (batch) whereas the pipeline RunID is specific to that pipeline. 触发器ID遍历在该触发器实例(批处理)中执行的所有程序包,而管道RunID特定于该管道。

If you can't do the change tracking, the watermark is fine. 如果您无法进行更改跟踪,则可以使用水印。 I usually can't add change tracking to my sources and have to go with watermark. 我通常无法将更改跟踪添加到源中,而必须添加水印。 The issue is that you are trusting that the application's modified date is accurate. 问题是您相信该应用程序的修改日期是正确的。 Are there ever times when a row is updated and the modified date is not changed? 有没有什么时候更新过一行并且修改过的日期没有更改? When a row is inserted, is the modified date also updated or would you have to check two columns to get all new and changed rows? 当插入一行时,修改日期也会更新吗?还是您需要检查两列以获取所有新行和更改过的行? These are the things we have to consider when we can't use change tracking. 这些是我们无法使用变更跟踪时必须考虑的事项。

To summarize: 总结一下:

  • Load incrementally and name your incremental files intelligently 增量加载并智能地命名增量文件
  • If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone. 如果您需要数据湖中表的当前版本,则在“标准原始区域”或“策划区域”中这是一个单独的文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM