简体   繁体   English

如何在 ETL 管道中正确截断临时表?

[英]How to properly truncate a staging table in an ETL pipeline?

We have an ETL pipeline that runs for each CSV uploaded into an storage account (Azure).我们有一个 ETL 管道,它为上传到存储帐户 (Azure) 的每个 CSV 运行。 It runs some transformations on the CSV and writes the outputs to another location, also as CSV, and calls a stored procedure on the database (SQL Azure) which ingests (BULK INSERT) this resulting CSV into a staging table.它在 CSV 上运行一些转换并将输出写入另一个位置,也作为 CSV,并调用数据库 (SQL Azure) 上的存储过程,该过程将生成的 CSV 摄取 (BULK INSERT) 到一个临时表中。

This pipeline can have concurrent executions as multiple resources can be uploading files to the storage.该管道可以同时执行,因为多个资源可以将文件上传到存储。 Hence, the staging table is getting data inserted pretty often.因此,临时表经常插入数据。

Then, we have an scheduled SQL job (Elastic Job) that triggers an SP that moves the data from the staging table into the final table.然后,我们有一个计划的 SQL 作业(弹性作业),它触发一个 SP,将数据从临时表移动到最终表中。 At this point, we would want to truncate/empty the staging table so that we do not re-insert them in the next execution of the job.此时,我们希望截断/清空临时表,以便我们不会在下次执行作业时重新插入它们。

Problem is, we cannot be sure that between the load from the staging table to the final table and the truncate command, there has not been any new data written into the staging table that could be truncated without first being inserted in to the final table.问题是,我们无法确定在从暂存表加载到最终表和 truncate 命令之间,没有任何新数据写入暂存表可以在没有先插入最终表的情况下被截断。

Is there a way to lock the staging table while we're copying the data into the final table so that the SP (called from the ETL pipeline) trying to write to it will just wait until the lock is release?有没有办法在我们将数据复制到最终表时锁定临时表,以便尝试写入它的 SP(从 ETL 管道调用)只会等到锁定被释放? Is this achievable by using transactions or maybe some manual lock commands?这是否可以通过使用事务或一些手动锁定命令来实现?

If not, what's the best approach to handle this?如果没有,处理这个问题的最佳方法是什么?

I like the sp_getapplock and use this method myself in few places for its flexibility and that you have full control over the locking logic and wait times.我喜欢sp_getapplock并且我自己在几个地方使用这种方法,因为它的灵活性以及您可以完全控制锁定逻辑和等待时间。

The only problem that I see is that in your case concurrent processes are not all equal.我看到的唯一问题是,在您的情况下,并发进程并不完全相同。

You have SP1 that moves data from the staging table into the main table.您有将数据从临时表移动到主表的 SP1。 Your system never tries to run several instances of this SP.您的系统从不尝试运行此 SP 的多个实例。

Another SP2 that inserts data into the staging table can be run several times simultaneously and it is fine to do it.另一个将数据插入临时表的 SP2可以同时运行多次,这样做很好。

It is easy to implement the locking that would prevent any concurrent run of any combination of SP1 or SP2.很容易实现锁定,以防止 SP1 或 SP2 的任何组合的任何并发运行。 In other words, it is easy if the locking logic is the same for SP1 and SP2 and they are treated equal.换句话说,如果 SP1 和 SP2 的锁定逻辑相同并且它们被平等对待,则很容易。 But, then you can't have several instances of SP2 running simultaneously.但是,您不能同时运行多个 SP2 实例。

It is not obvious how to implement the locking that would prevent concurrent run of SP1 and SP2, while allowing several instances of SP2 to run simultaneously.如何实现阻止 SP1 和 SP2 并发运行的锁定,同时允许多个 SP2 实例同时运行,这一点并不明显。


There is another approach that doesn't attempt to prevent concurrent run of SPs, but embraces and expects that simultaneous runs are possible.还有另一种方法不会试图阻止 SP 的并发运行,而是包含并期望同时运行是可能的。

One way to do it is to add an IDENTITY column to the staging table.一种方法是将IDENTITY列添加到临时表。 Or an automatically populated datetime if you can guarantee that it is unique and never decreases, which can be tricky.或者一个自动填充的日期时间,如果你能保证它是唯一的并且永远不会减少,这可能很棘手。 Or rowversion column.rowversion列。

The logic inside SP2 that inserts data into the staging table doesn't change. SP2 内部将数据插入临时表的逻辑没有改变。

The logic inside SP1 that moves data from the staging table into the main table needs to use these identity values. SP1 内部将数据从临时表移动到主表的逻辑需要使用这些标识值。

At first read the current maximum value of identity from the staging table and remember it in a variable, say, @MaxID .首先从临时表中读取身份的当前最大值并将其记住在一个变量中,例如@MaxID All subsequent SELECTs, UPDATEs and DELETEs from the staging table in that SP1 should include a filter WHERE ID <= @MaxID .来自该 SP1 中的临时表的所有后续 SELECT、UPDATE 和 DELETE 应包括过滤器WHERE ID <= @MaxID

This would ensure that if there happen to be a new row added to the staging table while SP1 is running, that row would not be processed and would remain in the staging table until the next run of SP1.这将确保如果在 SP1 运行时碰巧有新行添加到临时表中,则该行不会被处理并且会保留在临时表中,直到 SP1 的下一次运行。

The drawback of this approach is that you can't use TRUNCATE , you need to use DELETE with WHERE ID <= @MaxID .这种方法的缺点是你不能使用TRUNCATE ,你需要使用DELETEWHERE ID <= @MaxID


If you are OK with several instances of SP2 waiting for each other (and SP1), then you can use sp_getapplock similar to the following.如果您同意多个 SP2 实例(和 SP1)相互等待,那么您可以使用类似于以下内容的sp_getapplock I have this code in my stored procedure.我的存储过程中有此代码。 You should put this logic into both SP1 and SP2.您应该将此逻辑放入 SP1 和 SP2。

I'm not calling sp_releaseapplock explicitly here, because the lock owner is set to Transaction and engine will release the lock automatically when transaction ends.我在这里没有明确调用sp_releaseapplock ,因为锁所有者设置为事务,引擎会在事务结束时自动释放锁。

You don't have to put retry logic in the stored procedure, it can be within external code that runs these stored procedures.您不必将重试逻辑放在存储过程中,它可以在运行这些存储过程的外部代码中。 In any case, your code should be ready to retry.无论如何,您的代码应该准备好重试。

CREATE PROCEDURE SP2  -- or SP1
AS
BEGIN
    SET NOCOUNT ON;
    SET XACT_ABORT ON;

    BEGIN TRANSACTION;
    BEGIN TRY
        -- Maximum number of retries
        DECLARE @VarCount int = 10;

        WHILE (@VarCount > 0)
        BEGIN
            SET @VarCount = @VarCount - 1;

            DECLARE @VarLockResult int;
            EXEC @VarLockResult = sp_getapplock
                @Resource = 'StagingTable_app_lock',
                -- this resource name should be the same in SP1 and SP2
                @LockMode = 'Exclusive',
                @LockOwner = 'Transaction',
                @LockTimeout = 60000,
                -- I'd set this timeout to be about twice the time
                -- you expect SP to run normally
                @DbPrincipal = 'public';

            IF @VarLockResult >= 0
            BEGIN
                -- Acquired the lock

                -- for SP2
                -- INSERT INTO StagingTable ...

                -- for SP1
                -- SELECT FROM StagingTable ...
                -- TRUNCATE StagingTable ...

                -- don't retry any more
                BREAK;
            END ELSE BEGIN
                -- wait for 5 seconds and retry
                WAITFOR DELAY '00:00:05';
            END;
        END;

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION;
        -- log error
    END CATCH;

END

This code guarantees that only one procedure is working with the staging table at any given moment.此代码保证在任何给定时刻只有一个过程在使用临时表。 There is no concurrency.没有并发。 All other instances will wait.所有其他实例将等待。

Obviously, if you try to access the staging table not through these SP1 or SP2 (which try to acquire the lock first), then such access will not be blocked.显然,如果您尝试不通过这些 SP1 或 SP2(它们首先尝试获取锁)来访问临时表,那么此类访问将不会被阻止。

I would propose solution with two identical staging tables.我会提出两个相同的临时表的解决方案。 Lets name them StageLoading and StageProcessing.让我们将它们命名为 StageLoading 和 StageProcessing。
Load process would have following steps:加载过程将有以下步骤:
1. At the beginning both tables are empty. 1. 开始时两个表都是空的。
2. We load some data into StageLoading table (I assume each load is a transaction). 2. 我们将一些数据加载到 StageLoading 表中(我假设每次加载都是一个事务)。
3. When Elastic job starts it will do: 3. 当 Elastic 工作开始时,它会做:
- ALTER TABLE SWITCH to move all data from StageLoading to StageProcessing. - ALTER TABLE SWITCH 将所有数据从 StageLoading 移动到 StageProcessing。 It will make StageLoading empty and ready for next loads.它将使 StageLoading 为空并准备好进行下一次加载。 It is a metadata operation, so takes miliseconds and it is fully blocking, so will be done between loads.这是一个元数据操作,所以需要几毫秒并且它是完全阻塞的,所以将在加载之间完成。
- load the data from StageProcessing to final tables. - 将 StageProcessing 中的数据加载到最终表格中。
- truncate table StageProcessing. - 截断表 StageProcessing。
4. Now we are ready for next Elastic job. 4. 现在我们准备好下一个 Elastic 工作了。

If we try to do SWITCH when StageProcessing is not empty, ALTER will fail and it will mean that last load process failed.如果我们在 StageProcessing 不为空时尝试执行 SWITCH,则 ALTER 将失败,这意味着上次加载过程失败。

Is there a way to lock the staging table while we're copying the data into the final table so that the SP (called from the ETL pipeline) trying to write to it will just wait until the lock is release?有没有办法在我们将数据复制到最终表时锁定临时表,以便尝试写入它的 SP(从 ETL 管道调用)只会等到锁定被释放? Is this achievable by using transactions or maybe some manual lock commands?这是否可以通过使用事务或一些手动锁定命令来实现?

It looks you are searching for a mechanism that is wider than a transaction level.看起来您正在寻找一种比事务级别更广泛的机制。 SQL Server/Azure SQL DB has one and it is called application lock : SQL Server/Azure SQL DB 有一个,它被称为应用程序锁

sp_getapplock sp_getapplock

Places a lock on an application resource.锁定应用程序资源。

Locks placed on a resource are associated with either the current transaction or the current session.放置在资源上的锁与当前事务或当前会话相关联。 Locks associated with the current transaction are released when the transaction commits or rolls back.当事务提交或回滚时,会释放与当前事务关联的锁。 Locks associated with the session are released when the session is logged out.当会话被注销时,与会话关联的锁被释放。 When the server shuts down for any reason, all locks are released.当服务器因任何原因关闭时,所有锁都会被释放。

Locks can be explicitly released with sp_releaseapplock.可以使用 sp_releaseapplock 显式释放锁。 When an application calls sp_getapplock multiple times for the same lock resource, sp_releaseapplock must be called the same number of times to release the lock.当应用程序为同一锁资源多次调用 sp_getapplock 时,必须调用相同次数的 sp_releaseapplock 才能释放锁。 When a lock is opened with the Transaction lock owner, that lock is released when the transaction is committed or rolled back.当使用事务锁所有者打开锁时,该锁会在事务提交或回滚时释放。

It basically means that your ETL Tool should open single session to DB, acquire the lock and release when finished.这基本上意味着您的 ETL 工具应该打开单个会话到数据库,获取锁定并在完成后释放。 Other sessions before trying to do anything should try to acquire the lock(they cannot because it already taken), wait until when it released and continue to work.其他会话在尝试做任何事情之前应该尝试获取锁(他们不能,因为它已经被占用了),等到它释放并继续工作。

Assuming you have a single outbound job假设您有一个出境工作

  • Add an OutboundProcessing BIT DEFAULT 0 to the table将 OutboundProcessing BIT DEFAULT 0 添加到表中
  • In the job, SET OutboundProcessing = 1 WHERE OutboundProcessing = 0 (claim the rows)在作业中, SET OutboundProcessing = 1 WHERE OutboundProcessing = 0(声明行)
  • For the ETL, incorporate WHERE OutboundProcessing = 1 in the query that sources the data (transfer the rows)对于 ETL,在提供数据的查询中合并 WHERE OutboundProcessing = 1(传输行)
  • After the ETL, DELETE FROM TABLE WHERE OutboundProcessing = 1 (remove the rows you transferred)在 ETL 之后,DELETE FROM TABLE WHERE OutboundProcessing = 1(删除您传输的行)
  • If the ETL fails, SET OutboundProcessing = 0 WHERE OutboundProcessing = 1如果 ETL 失败,则 SET OutboundProcessing = 0 WHERE OutboundProcessing = 1

I always prefer to "ID" each file I receive.我总是喜欢“识别”我收到的每个文件。 If you can do this, you can associate the records from a given file throughout your load process.如果可以这样做,则可以在整个加载过程中关联给定文件中的记录。 You haven't called out a need for this, but jus sayin.你没有提出需要这个,但只是说。

However, with each file having an identity (just a int/bigint identity value should do) you can then dynamically create as many load tables as you like from a "template" load table.但是,每个文件都有一个标识(应该只使用 int/bigint 标识值),然后您可以从“模板”加载表动态创建任意数量的加载表。

  1. When a file arrives, create a new load table named with the ID of the file.当文件到达时,创建一个以文件 ID 命名的新加载表。
  2. Process your data from load to final table.处理从加载到最终表的数据。
  3. drop the load table for the file being processed.删除正在处理的文件的加载表。

This is somewhat similar to the other solution about using 2 tables (load and stage) but even in that solution you are still limited to having 2 files "loaded" (your still only applying one file to the final table though?)这有点类似于使用 2 个表(加载和暂存)的其他解决方案,但即使在该解决方案中,您仍然只能“加载”2 个文件(尽管您仍然只将一个文件应用于最终表?)

Last, it is not clear if your "Elastic Job" is detached from the actual "load" pipeline/processing or if it is included.最后,不清楚您的“弹性作业”是否与实际的“加载”管道/处理分离或是否包含在内。 Being a job, I assume it is not included, if a job, you can only run a single instance at time?作为一项工作,我认为它不包括在内,如果是一项工作,您一次只能运行一个实例? So its not clear why it's important to load multiple files at once if you can only move one from load to final at a time.因此,如果您一次只能将一个文件从 load 移动到 final,那么为什么一次加载多个文件很重要就不清楚了。 Why the rush to get files into load?为什么急于加载文件?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM