如何在 SQL Server 中完全自动化 CDC？

Question

Is there a way to 100% automate SQL Server CDC initialization in an active SQL Server database?有没有办法在活动的 SQL Server 数据库中 100% 自动化 SQL Server CDC 初始化？ I am trying to solve a problem finding from_lsn during first cdc data capture.我正在尝试解决在第一次 cdc 数据捕获期间发现from_lsn的问题。

Sequence of events:事件顺序：

Enable CDC on given database/Table在给定的数据库/表上启用 CDC
Copy full table to destination (Data lake)将完整表复制到目标（数据湖）
Use CDC to capture first delta (I want to avoid duplicates, without missing a transaction)使用 CDC 捕获第一个增量（我想避免重复，而不会丢失事务）

Problem:问题：

How to get the from_lsn for fn_cdc_get_all_changes_Schema_Table(from_lsn, to_lsn, '<row_filter_option>') function如何获取 fn_cdc_get_all_changes_Schema_Table( from_lsn fn_cdc_get_all_changes_Schema_Table(from_lsn, to_lsn, '<row_filter_option>')函数的 from_lsn

Note:笔记：

Need to automate 100%需要 100% 自动化
Can not stop transactions on the table无法停止表上的事务
Can not miss any data or can not afford duplicate data不能遗漏任何数据或不能承受重复数据

Answer 1

Before doing the initial load, get the value of fn_cdc_get_max_lsn() and store it.在进行初始加载之前，获取fn_cdc_get_max_lsn()的值并存储它。 This function returns the highest LSN known to CDC across all capture instances.此函数返回所有捕获实例中 CDC 已知的最高 LSN。 It's the high water mark for the whole database.这是整个数据库的高水位线。

Copy the whole table.复制整个表格。

Start your delta process.开始您的增量过程。 The first time you call the delta function, the value of the min_lsn argument will be the stored value previously retrieved from fn_cdc_get_max_lsn() .第一次调用 delta 函数时， min_lsn参数的值将是之前从fn_cdc_get_max_lsn()检索到的存储值。 Get the current value from fn_cdc_get_max_lsn() (not the stored one) and use it as the value of the max_lsn argument.从fn_cdc_get_max_lsn()获取当前值（不是存储的）并将其用作max_lsn参数的值。

From here proceed as you expect.从这里按您的预期进行。 Take the maximum LSN returned from the delta function, store it.获取从 delta 函数返回的最大 LSN，存储它。 Next time you pull a delta, use fn_cdc_increment_lsn on the stored value, use the result as the value of the min_lsn argument, and use the result of fn_cdc_get_max_lsn() as the max_lsn argument.下次拉增量时，对存储的值使用fn_cdc_increment_lsn ，将结果用作min_lsn参数的值，并将fn_cdc_get_max_lsn()的结果用作max_lsn参数。

With this process you will never miss any data.通过此过程，您将永远不会错过任何数据。

Now, you mentioned that you want to avoid "duplicates".现在，您提到要避免“重复”。 But if you try to define what a "duplicate" is in this scenario, I think you'll find it difficult.但是，如果您尝试在这种情况下定义“重复”是什么，我认为您会发现它很困难。

For example, suppose I have this table to begin with:例如，假设我有这个表开始：

create table t(i int primary key, c char);
insert t(i, c) values (1, 'a');

I call fn_cdc_get_max_lsn() and get 0x01 .我调用fn_cdc_get_max_lsn()并得到0x01 。
A user inserts a new row into the table: insert t(i, c) values (2, 'b');用户向表中插入新行： insert t(i, c) values (2, 'b');
The user operation is associated with an LSN value of 0x02 .用户操作与0x02的 LSN 值相关联。
I select all the rows in this table (getting two rows).我选择了这个表中的所有行（得到两行）。
I write both rows to my destination table.我将两行都写入我的目标表。
I start my delta process.我开始我的增量过程。 My min_lsn argument will be 0x01 .我的min_lsn参数将是0x01 。

I will therefore get the {2, 'b'} row in the delta.因此，我将在增量中获得{2, 'b'}行。

But I already retrieved the row {2, 'b'} as part of my initial load.但我已经检索了行{2, 'b'}作为初始加载的一部分。 Is this a "duplicate"?这是“重复”吗？ No, this represents a change to the table.不，这代表对表格的更改。 What will I do with this delta when I load it into my destination?当我将这个增量加载到我的目的地时，我将如何处理它？ There are really only two options.实际上只有两种选择。

Option 1: I am going to merge the delta into the destination table based on the primary key.选项 1：我将根据主键将增量合并到目标表中。 In that case, when I merge the delta I will overwrite the already-loaded row {2, 'b'} with the new row {2, 'b'} , the outcome of which looks the same as not doing anything.在这种情况下，当我合并增量时，我将用新行{2, 'b'}覆盖已经加载的行{2, 'b'} , 'b'} ，其结果看起来与不做任何事情相同。

Option 2: I am going to append all changes to the destination.选项 2：我要将所有更改附加到目的地。 In that case my destination table will contain the row {2, 'b'} twice.在这种情况下，我的目标表将包含行{2, 'b'}两次。 Is this a duplicate?这是重复的吗？ No , because the two rows represent the how the data looked at different logical times.不，因为这两行代表数据在不同逻辑时间的外观。 First when I did the initial load, and then when I did the delta.首先是当我进行初始加载时，然后是当我进行增量时。

If you try to argue that this is in fact a duplicate, then I counter by giving you this hypothetical scenario:如果您试图争辩这实际上是重复的，那么我通过给您这个假设场景来反驳：

You do the initial load, receiving row {1, 'a'} ,您进行初始加载，接收行{1, 'a'} ，
No users change any data.没有用户更改任何数据。
You get your first delta, which is empty.你得到你的第一个增量，它是空的。
A user executes update T set c = 'b' where i = 1 .用户执行update T set c = 'b' where i = 1 。
You get your second delta, which will include the row {1, 'b'} .您将获得第二个增量，其中将包括行{1, 'b'} 。
A user executes update T set c = 'a' where i = 1 .用户执行update T set c = 'a' where i = 1 。
You get your third delta, which will include the row {1, 'a'} .您将获得第三个增量，其中将包括行{1, 'a'} 。

Question: Is the row you retrieved during your third delta a "duplicate"?问题：您在第三个增量期间检索到的行是“重复的”吗？ Is has the same values as a row we already retrieved previously. Is 与我们之前检索到的行具有相同的值。

If your answer is "yes", then you can never eliminate "duplicate" reads, because a "duplicate" will occur any time a row mutates to have the same values it had at some previous point in time, which is something over which you have no control.如果您的回答是“是”，那么您将永远无法消除“重复”读取，因为只要一行发生突变以具有与之前某个时间点相同的值时，就会发生“重复”，这是您在无法控制。 If this is a "duplicate" that you need to eliminate in the append scenario, then that elimination must be performed at the destination , by comparing the incoming values with the existing values.如果这是您需要在追加方案中消除的“重复项”，则必须通过将传入值与现有值进行比较，在目的地执行消除。

如何在 SQL Server 中完全自动化 CDC？

问题描述

1 个解决方案

解决方案1
0 2022-07-20 22:46:43

如何在 SQL Server 中完全自动化 CDC？

问题描述

1 个解决方案

解决方案1 0 2022-07-20 22:46:43

解决方案1
0 2022-07-20 22:46:43