简体   繁体   English

使用 Azure 数据工厂进行增量数据传输

[英]Incremental data transfer using Azure Data Factory

In an on-premises SQL Server database, I have a number of tables in to which various sales data for a chain of stores is inserted during the day.在本地 SQL Server 数据库中,我有许多表,白天在其中插入了一系列商店的各种销售数据。 I would like to "harvest" these data to Azure every, say 15, minutes via Data Factory and an on-premises data management gateway.我想通过数据工厂和本地数据管理网关每隔 15 分钟将这些数据“收集”到 Azure。 Clearly, I am not interested in copying all table data every 15 minutes, but only in copying the rows that have been inserted since last fetch.显然,我对每 15 分钟复制一次所有表数据不感兴趣,而只对复制自上次提取以来插入的行感兴趣。 As far as I can see, the documentation suggests using data "slices" for this purpose.据我所知, 文档建议为此目的使用数据“切片”。 However, as far as I can see, these slices require a timestamp (eg a datetime) column to exist on the tables where data is fetched from.但是,据我所知,这些切片需要时间戳(例如日期时间)列存在于从中获取数据的表中。

  1. Can I perform a "delta" fetch (ie only fetch the rows inserted since last fetch) without having such a timestamp column?我可以在没有这样的时间戳列的情况下执行“增量”提取(即只提取自上次提取以来插入的行)吗? Could I use a sequential integer column instead?我可以改用顺序整数列吗? Or even have no incrementally increasing column at all?或者甚至根本没有增量增加的列?
  2. Assume that the last slice fetched had a window from 08:15 to 08:30.假设获取的最后一个切片有一个从 08:15 到 08:30 的窗口。 Now, if the clock on the database server is a bit behind the Azure clock, it might add some rows with the timestamp being set to 08:29 after that slice was fetched, and these rows will not be included when the next slice (08:30 to 08:45) is fetched.现在,如果数据库服务器上的时钟比 Azure 时钟稍晚,它可能会在获取该切片添加一些时间戳设置为 08:29 的行,并且在下一个切片 (08 :30 到 08:45) 被提取。 Is there a smart way to avoid this problem?有没有聪明的方法来避免这个问题? Shifting the slice window a few minutes into the past could minimize the risk, but not totally eliminate it.将切片窗口移到过去几分钟可以最大限度地降低风险,但不能完全消除风险。

Take Azure Data Factory out of the equation.将 Azure 数据工厂排除在外。 How do you arrange for transfer of deltas to a target system?您如何安排将增量传输到目标系统? I think you have a few options:我想你有几个选择:

  1. add date created / changed columns to the source tables.将日期创建/更改列添加到源表。 Write parameterised queries to pick up only new or modified values.编写参数化查询以仅获取新值或修改值。 ADF supports this scenario with time slices and system variables . ADF 通过时间片和系统变量支持这种情况。 Re identity column, you could do that with a stored procedure (as per here ) and a table tracking the last ID sent.对于身份列,您可以使用存储过程(按照此处)和跟踪发送的最后一个 ID 的表来实现。
  2. Engage Change Data Capture (CDC) on the source system.在源系统上进行变更数据捕获(CDC)。 This will allow you to access deltas via the CDC functions.这将允许您通过 CDC 功能访问增量。 Wrap them in a proc and call with the system variables, similar to the above example.将它们包装在 proc 中并使用系统变量调用,类似于上面的示例。
  3. Always transfer all data, eg to staging tables on the target.始终传输所有数据,例如传输到目标上的临时表。 Use delta code EXCEPT and MERGE to work out what records have change;使用增量代码EXCEPTMERGE出哪些记录发生了变化; obviously not ideal for large volumes, this would work for small volumes.显然不适合大体积,这适用于小体积。

HTH HTH

We are planning to add this capability into ADF.我们计划将此功能添加到 ADF 中。 It may start from sequential integer column instead of timestamp.它可以从连续整数列而不是时间戳开始。 Could you please let me know if the sequential integer column will help?你能告诉我顺序整数列是否有帮助吗?

By enabling "Change Tracking" on SQL Server, you can leverage on the "SYS_CHANGE_VERSION " to incrementally load data from On-premise SQL Server or Azure SQL Database via Azure Data Factory.通过在 SQL Server 上启用“更改跟踪”,您可以利用“SYS_CHANGE_VERSION”通过 Azure 数据工厂从本地 SQL Server 或 Azure SQL 数据库增量加载数据。

https://docs.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-tracking-feature-portal https://docs.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-tracking-feature-portal

If using SQL Server 2016, see https://msdn.microsoft.com/en-us/library/mt631669.aspx#Enabling-system-versioning-on-a-new-table-for-data-audit .如果使用 SQL Server 2016,请参阅https://msdn.microsoft.com/en-us/library/mt631669.aspx#Enabling-system-versioning-on-a-new-table-for-data-audit Otherwise, you can implement the same using triggers.否则,您可以使用触发器实现相同的功能。

And use NTP to synchronize your server time.并使用 NTP 同步您的服务器时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM