简体繁体 English

如何在 ETL 期间用代理键替换主键？

[英]How to replace primary key with surrogate keys during ETL?

原文 2021-03-07 19:54:31 4 1 sql/ etl/ primary-key/ surrogate-key/ sql-data-warehouse

Have a question that is haunting me for some time.有一个困扰我一段时间的问题。

How in practice looks replacing primary keys with surrogate keys during the ETL process?在 ETL 过程中，如何用代理键替换主键？ Like what is the workflow - is it just assigning new IDENTITY?就像工作流程一样 - 它只是分配新的身份吗？ If so, how about previous values, how to replace existing business keys with newly created ones?如果是这样，以前的值如何，如何用新创建的业务键替换现有的业务键？

In my mind a specific workflow looks like below, but I haven't done it in practice yet:在我看来，一个特定的工作流程如下所示，但我还没有在实践中完成它：

Drop existing PK_Product and FK_Product in DimProduct and FactSales tables.删除 DimProduct 和 FactSales 表中的现有 PK_Product 和 FK_Product。
Set a new IDENTITY column to dimProduct.将新的 IDENTITY 列设置为 dimProduct。
Add new column to FactSales with values from newly created IDENTITY column based on join on previous business key.将新列添加到 FactSales，其值来自基于先前业务键联接的新创建的 IDENTITY 列。
Drop an old ProductKey columns in both tables.删除两个表中的旧 ProductKey 列。
Add constraints for newly created surrogate IDENTITY keys.为新创建的代理 IDENTITY 键添加约束。
Assign reference between tables for future-coming values.为未来的值分配表之间的参考。

But please tell me how you do this in your job and correct me, because I think I'm wrong.但请告诉我你在工作中是如何做到这一点的并纠正我，因为我认为我错了。

1 个解决方案

Let's take the simplest case where your target dimension is being loaded from a single source system.让我们以最简单的情况为例，您的目标维度是从单个源系统加载的。 The basic steps would be:基本步骤是：

Take the unique identifier for the source system record - normally either the PK or BK获取源系统记录的唯一标识符 - 通常是 PK 或 BK
Use this identifier to lookup the corresponding record in the target dimension - which holds this identifier as well as the SK and other attributes - and return the SK if a record is found in the Dim使用这个标识符在目标维度中查找对应的记录——它包含这个标识符以及 SK 和其他属性——如果在 Dim 中找到记录，则返回 SK
If an SK is found then you are going to perform an Update on the Dim using the SK as the primary identifier如果找到 SK，那么您将使用 SK 作为主要标识符对 Dim 执行更新
a.一个。 You may also need to perform an insert eg if the Dim is SCD2您可能还需要执行插入，例如如果 Dim 是 SCD2
b.湾。 If there have been no changes between the source and target record you may decide not to process the source record如果源记录和目标记录之间没有更改，您可以决定不处理源记录
If no SK is found then you will insert a new record into the target Dim, generating a new SK value in one of two main ways:如果没有找到 SK，那么您将在目标 Dim 中插入一条新记录，以两种主要方式之一生成新的 SK 值：
a.一个。 Using the capabilities of the underlying database, such as sequences, auto-increment columns, etc.使用底层数据库的能力，例如序列、自增列等。
b.湾。 Using the capabilities of your ETL tool eg a sequence generator使用 ETL 工具的功能，例如序列生成器

These are obviously the logically steps you need to follow.这些显然是您需要遵循的逻辑步骤。 How you actually implement them depends entirely on your ETL/ELT components - so running a merge command in your DB will look very different from an Informatica workflow but "under the covers" both processes are following the same logical steps您如何实际实现它们完全取决于您的 ETL/ELT 组件 - 因此在您的数据库中运行合并命令看起来与 Informatica 工作流非常不同，但“在幕后”这两个过程都遵循相同的逻辑步骤