简体繁体 English

在数据仓库中使用代理键的利弊

[英]Using Surrogate Keys in Data Warehouse Pros and Cons

原文 2020-10-29 08:55:03 4 1 sql/ performance/ etl/ data-warehouse/ surrogate-key

A surrogate key is a mechanism that exists in our books for years and I hate for bringing into discussion again.代理键是一种在我们的书中存在多年的机制，我讨厌再次讨论。 Everyone is talking about the benefits of using a surrogate key instead of a business key.每个人都在谈论使用代理键而不是业务键的好处。 Even Microsoft Analysis Services Tabular and Microsoft PowerBI Tabular Models are working with the surrogate key.甚至 Microsoft Analysis Services 表格和 Microsoft PowerBI 表格模型也使用代理键。 Both platforms mentioned give you the ability to connect a dimension and a fact using one column, and therefore is a surrogate key, as is very difficult to have one single business key in real life.提到的两个平台都使您能够使用一列连接维度和事实，因此是一个代理键，因为在现实生活中很难拥有一个单一的业务键。

Working as BI Architect in the latest years I worked with Analysis Services Multidimensional and Tabular, I had projects in Multidimensional, which were managed up to 500GB in the DataWarehouse each night.在最近几年担任 BI 架构师时，我曾与 Analysis Services Multidimensional 和 Tabular 一起工作，我有多维项目，每晚在 DataWarehouse 中管理高达 500GB 的项目。 I faced facts contracted from 5-6 unions and 8-10 joins among tables with millions of records.我面临着从 5-6 个联合和 8-10 个连接在具有数百万条记录的表中收缩的事实。

Here comes the question, using Surrogate Key, in order the fact to be able to know the dimensions Key we need to make an extra Join.问题来了，使用代理键，为了能够知道我们需要进行额外连接的键的维度。 As a result, if we want to be able to "Relate" N dimensions (which are not already connected with a fact in construction expression) with a single Fact we need N additional Joins in the DataWarehouse.因此，如果我们希望能够将 N 个维度（尚未与构造表达式中的事实相关联）与单个 Fact 进行“关联”，我们需要在 DataWarehouse 中添加 N 个连接。

Let's take the previous example, so for this particular fact, we need 5-6 unions + (8-10 + N) joins which increases the complexity, image of what will happen once we have the requirement to relate this fact with 10-15 dimensions to get the surrogate key.让我们以前面的例子为例，所以对于这个特定的事实，我们需要 5-6 个联合 + (8-10 + N) 个连接，这增加了复杂性，想象一下一旦我们需要将这个事实与 10-15 关联起来会发生什么获取代理键的维度。

All these years I was trying to read my facts expressions using my early coffee like reading a newspaper and remove unused columns, unions, joins, and make everything to reduce the complexity ta save ETL process time.这些年来，我一直在尝试使用我早期的咖啡来阅读我的事实表达式，例如阅读报纸并删除未使用的列、联合、连接，并尽一切努力降低复杂性以节省 ETL 过程时间。

Its fully understand that we will save time for querying DataWarehouse and Semantic Layer, but what about ETL, I am missing something?它完全理解我们将节省查询DataWarehouse和Semantic Layer的时间，但是ETL呢，我错过了什么？

1 个解决方案

a couple of thoughts about your question...关于你的问题的一些想法......

If you didn't use SKs then how would you handle SCD2 dimensions where the natural/business keys from the source system (even if they were a single column) wouldn't be unique?如果您不使用 SK，那么您将如何处理来自源系统的自然/业务键（即使它们是单个列）不是唯一的 SCD2 维度？
The purpose of a DW is to make it easier and quicker to query your data. DW 的目的是使查询数据更容易、更快捷。 If you consider that any problem takes a certain amount of effort to resolve then you have a choice where you apply that effort in the chain of activities required to produce the solution.如果您认为任何问题都需要付出一定的努力才能解决，那么您可以选择在生成解决方案所需的活动链中应用该努力的位置。 If you want to reduce the effort of querying then you need to increase the effort in data preparation ie your ETL如果您想减少查询的工作量，那么您需要增加数据准备工作，即您的 ETL