简体繁体 English

数据仓库中的Fact表上是否需要Surrogate主键？

[英]Are Surrogate Primary Keys needed on a Fact table in a Data Warehouse?

原文 2009-05-30 18:10:51 1 12 database-design/ business-intelligence

When I asked our DB designers why our Fact table do not have a PK, I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected. 当我问我们的数据库设计者为什么我们的Fact表没有PK时，我被告知表中没有唯一标识记录的列，即使选择了所有列。 Whenb I suggested that we an an identity column in that case I was told that "I'd just be wasting space and that it wasn't needed." 当我建议我们在这种情况下成为一个标识栏时，我被告知“我只是浪费空间而且不需要它。”

My feeling is that every table in the source system should have a PK, even if it is an identity column. 我的感觉是源系统中的每个表都应该有一个PK，即使它是一个标识列。 Given that the data warehouse (DW) is a recipient of data from other system-how would I otherwise be able to ensure that the data in the DW accurately reflects what is in the source system if there is no way to tie individual records? 鉴于数据仓库（DW）是来自其他系统的数据的接收者 - 如果无法绑定单个记录，我将如何确保DW中的数据准确反映源系统中的数据？ If you have a runaway load program that screws up data and has run for a week, how would you reconcile the differences with a live transaction source system w/o some sort of unique constraint to compare? 如果你有一个失控的加载程序搞砸了数据并运行了一个星期，你将如何协调与实时事务源系统的差异与某种独特的约束条件进行比较？

12 个解决方案

A data warehouse is not necessarily a relational data store, although you may choose to make it one, so relational definitions don't necessarily apply. 数据仓库不一定是关系数据存储，尽管您可以选择将其设置为一个，因此关系定义不一定适用。

A primary key is only required if you want to do something with the data that requires a unique identifier (like trace it to a source, but that's not always required or necessary or even possible anyway); 只有当您想要对需要唯一标识符的数据执行某些操作时才需要主键（例如，将其跟踪到源，但这并不总是必需或必要，甚至可能无论如何）; and data in a data warehouse can often be used in ways that don't require primary keys. 数据仓库中的数据通常可以以不需要主键的方式使用。 Specifically, you may not need to distinguish rows from each other. 具体而言，您可能不需要区分行。 Most often for constructing aggregate values. 通常用于构建聚合值。

Time is not a required dimension in constructing data warehouse tables. 时间不是构建数据仓库表所必需的维度。

It may be psychologically uncomfortable, and wasted space is a trivial issue, but your colleague is correct - PKs aren't necessary. 它可能在心理上不舒服，浪费空间是一个微不足道的问题，但你的同事是正确的 - PK是没有必要的。

You should at least have a natural key on the fact table so you can identify rows and reconcile them against source or track changes where this is necessary. 您至少应该在事实表上有一个自然键，以便您可以识别行并将其与源进行协调，或者在必要时跟踪更改。

On SQL Server an identity column gives you a surrogate key for free and on other systems using sequences (eg Oracle) it can be added fairly easily. 在SQL Server上，标识列为您提供免费的代理键以及使用序列的其他系统（例如Oracle），它可以相当容易地添加。 Surrogate fact table keys can be useful for various different reasons. 代理事实表键可用于各种不同的原因。 Some possible applications are: 一些可能的应用是：

Some tools like to have numeric keys on fact tables, preferably monotonically increasing ones. 有些工具喜欢在事实表上使用数字键，最好是单调增加。 An example of this is MS SQL Server Analysis Services, which really likes to have a numeric, monotonically increasing key for fact tables used to populate measure groups. 一个例子是MS SQL Server Analysis Services，它真的喜欢为用于填充度量值组的事实表提供数字，单调递增的键。 This is especially required for incremental loads. 这对增量负载尤其要求。
If you have any relationships between fact tables (for example a written - earned premium breakdown for those familiar with Insurance) then a synthetic key is helpful here. 如果您在事实表之间存在任何关系（例如，熟悉保险的人员的书面获得的保费分类），那么合成密钥在这里很有帮助。
If you have dimensions living in a M:M relationship with a fact table (eg ICD codes) then a numeric key on the fact table simplifies this. 如果您的维度与事实表（例如ICD代码）存在M：M关系，则事实表上的数字键可以简化此操作。
If you have any self-join requirements for transactions (eg certain transactions being corrections to others) then a synthetic key will simplify working with these. 如果您对事务有任何自联接要求（例如，某些事务正在对其他事务进行更正），那么合成密钥将简化这些事务的使用。
If you do contra-restate operations within your data warehouse (ie handle changes to transactional data by generating reversals and re-stating the row) then you can have multiple fact table rows for the same natural key. 如果您在数据仓库中重复操作（即通过生成反转并重新声明该行来处理对事务数据的更改），那么您可以为同一个自然键设置多个事实表行。

Otherwise, if you won't have anything joining to your fact table in a 1:M relationship then a synthetic key probably won't be used for anything. 否则，如果你没有任何东西以1：M关系加入你的事实表，那么合成密钥可能不会被用于任何东西。

An identity type column is a "surrogate" key that replaces one of your "candidate" keys (simply put). 标识类型列是一个“代理”键，用于替换您的“候选”键之一（简单地放置）。 Adding a surrogate key columns adds nothing if you can't identify a row without it. 如果没有它，则添加代理键列不会添加任何内容。 Which requires a candidate key. 这需要候选人密钥。

I always think that a table should be ordered by its most common queries or performance hitters, therefore the clustered index of a table should be in line with the most difficult or common query. 我一直认为表应该由最常见的查询或性能命中者排序，因此表的聚簇索引应该与最困难或常见的查询一致。

The primary key does not have to be a clustered index so I know you might be wondering where I am going with this but my concern is more about the clustered index than the primary key (and let's be honest, they normally follow each other). 主键不一定是聚簇索引，所以我知道你可能想知道我在哪里，但我更关心的是聚簇索引而不是主键（说实话，它们通常是互相跟随的）。

So the initial question for me is not "should I have a surrogate primary key on the fact table?" 所以对我来说最初的问题不是“我应该在事实表上有代理主键吗？” but more like "should I have a clustered index on the fact table?" 但更像是“我应该在事实表上有聚集索引吗？” I think the answer is yes you should have one (and yes there are other posts on this site covering this question but I still think it's worth mentioning in here just in case this is the question people are really asking despite wording it wrong) 我认为答案是肯定的，你应该有一个（是的，在这个网站上有其他帖子报道这个问题，但我仍然认为这里值得一提，以防这是人们真正问的问题，尽管写错了）

There are times you want a surrogate key but I would heartedly recommend that the surrogate is NOT the table's clustered index. 有时你想要一个代理键，但我会衷心地建议代理不是表的聚集索引。 Doing so would order the table in line with the meaningless surrogate key. 这样做会使表格与无意义的代理键一致。 (Often people add a surrogate identity column to a table and make it the primary key and also the clustered index by default) （通常人们将代理标识列添加到表中并使其成为主键，默认情况下也是聚簇索引）

So what columns to make the clustered index on? 那么要在哪些列上建立聚簇索引？ Personally I like date for fact tables and to this you might add some other dimension's FK for uniqueness but this will increase size and possibly not provide any benefit as in order for the index to be useful the relevant dimensions would have to be referenced (in the order of importance that the key was generated with). 我个人喜欢事实表的日期，为此您可能会添加其他维度的FK以获得唯一性，但这会增加大小并且可能不会提供任何好处，因为索引有用，必须引用相关维度（在密钥生成的重要性顺序）。

To get around this (and the reason I answer this here) I think you SHOULD add a surrogate and then create the clustered index on the date key and followed by the surrogate (in that order). 为了解决这个问题（以及我在这里回答这个问题的原因）我认为你应该添加一个代理，然后在日期键上创建聚集索引，然后是代理（按照该顺序）。 I do this because the date alone is not going to make a unique row but adding the surrogate will. 我这样做是因为单独的日期不会产生一个独特的行，而是添加代理意志。 This keeps the data ordered by date which helps all other non-clustered indexes and also keeps the clustered index size reasonable. 这样可以按日期排序数据，这有助于所有其他非聚集索引，并使聚簇索引大小合理。

Additionally as the data grows, you may want to partition it in which case you will need a partition key which will invariably be date. 此外，随着数据的增长，您可能希望对其进行分区，在这种情况下，您将需要一个始终为日期的分区键。 Building the clustered index with date as the primary part of key makes this easier. 使用日期作为键的主要部分构建聚簇索引可以使这更容易。 With partitioning you can now use sliding window technique to archive old data or in loading. 通过分区，您现在可以使用滑动窗口技术来存档旧数据或加载。

没有主键的数据库表似乎是一个糟糕的设计选择，并为不同类型的异常提供了大量空间，即如何删除或更新此类表中的单个记录？

You are correct--sort of. 你是对的 - 有点儿。 Without a primary key, a table does not meet the minimal definition of being relational. 没有主键，表不符合关系的最小定义。 It's fundamental to being a relation that it must not permit duplicate rows. 成为一个不能允许重复行的关系是至关重要的。 Tables in a Data Warehouse design should be relational, even if they're not strictly in normal form. 数据仓库设计中的表应该是关系型的，即使它们不是严格的正常形式。

So there must be some column (or set of columns) in the row that serve to identify rows uniquely. 因此，行中必须有一些列（或列集）用于唯一标识行。 But it doesn't necessarily have to be an identity column for a surrogate key. 但它不一定必须是代理键的标识列。

If the Fact Table has no set of columns that can serve this role of being a candidate key, then more Dimension Tables are needed in this DW, and more columns are needed in the Fact Table. 如果事实表没有可以作为候选键的角色的列集，那么此DW中需要更多的维度表，并且事实表中需要更多列。

This new Dimension alone may not be the primary key; 仅这个新的维度可能不是主键; it may be combined with existing columns in the Fact Table to create a candidate key. 它可以与Fact Table中的现有列组合以创建候选键。

I would agree with you. 我同意你的看法。

"I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected." “我被告知，即使所有列都被选中，表中也没有唯一标识记录的列。” - this seems to break something fundamental about relational databases as I understand them. - 正如我所理解的那样，这似乎打破了关系数据库的基本功能。

A fact consists of additive values plus foreign keys to dimensions. 事实包括附加值加上维度的外键。 Time is an obvious dimension that is common to every dimensional model that I know. 对于我所知道的每个维度模型来说，时间是一个明显的维度。 If nothing else, a composite key that contains timestamp would certainly be unique enough. 如果不出意外，包含时间戳的复合键肯定会足够独特。

I wonder if your DBAs have much knowledge about dimensional modeling. 我想知道你的DBA是否对维度建模有很多了解。 It's a different way of thinking from the normal relational, transactional style. 这是一种与正常的关系，交易风格不同的思维方式。

If the fact table is at the center of a star schema, then there is in reality a candidate key. 如果事实表位于星型模式的中心，那么实际上存在候选键。 If you take all the foreign keys in the fact table together, the ones that point to rows in the dimension tables, that's a candidate key. 如果将事实表中的所有外键一起使用，则指向维表中的行的那些外键，即候选键。

It probably would not do much good to declare it as a primary key. 将它声明为主键可能没什么好处。 The only thing it would do is protect you against a rogue ETL process. 它唯一能做的就是保护你免受恶意ETL过程的侵害。 The folks who run the warehouse might have the ETL processing well in hand. 运行仓库的人可能手头有ETL处理。

As far as indexing and query speed is concerned, that's a whole different issue with star schemas than it is with OLTP oriented databases. 就索引和查询速度而言，与使用面向OLTP的数据库相比，这与星型模式完全不同。 The people who run the warehouse may have that in hand as well. 经营仓库的人也可以拥有这些。

When designing a database for OLTP use, it's unwise to have a table without a primary key. 在设计用于OLTP的数据库时，拥有一个没有主键的表是不明智的。 The same considerations don't carry over into warehouses. 同样的考虑因素不会延续到仓库中。

using the combination of dimension surrogate keys as the primary key of the fact table doesnt work in all cases. 使用维度代理键的组合作为事实表的主键在所有情况下都不起作用。 Consider the case where there are three dimensions a, b and c. 考虑存在三个维度a，b和c的情况。 In most designs we usually have a dimension row for the "unknown", assume i always assign this row the surrogate key of -1. 在大多数设计中，我们通常有一个“未知”的维度行，假设我总是为此行指定-1的代理键。 I could easily have two rows in my fact table that have keys a=n1, b=n2 and c=-1, ie duplicate keys because the two rows have not got valid values for dimension c and so both resolve to the unknown row. 我可以轻松地在我的事实表中有两行，其中键a = n1，b = n2和c = -1，即重复键，因为这两行没有得到维度c的有效值，因此两者都解析为未知行。

You're conflating two issues here -- identifying a unique record in the fact table, and tracing records from the source system through to the fact table. 您在这里混淆了两个问题 - 在事实表中标识唯一记录，并跟踪从源系统到事实表的记录。

In the latter case it's quite possible for a single record in a source system to have multiple fact table records. 在后一种情况下，源系统中的单个记录很可能具有多个事实表记录。 Imagine a source system record that represents a transfer of funds from one account to another. 想象一下源系统记录，代表从一个帐户到另一个帐户的资金转移。 There might be two fact table records to represent this, one for the debited account and one for the credited account. 可能有两个事实表记录来表示这一点，一个用于借记帐户，一个用于贷记帐户。 Furthermore there might be multiple fact records to represent different states of the source system records at different points in it's lifecycle. 此外，可能存在多个事实记录来表示源系统记录在其生命周期中的不同点处的不同状态。

For the issue of the primary key on the fact table, there's really not a "correct" answer. 对于事实表上的主键问题，实际上没有“正确”的答案。 There are desirable/essential characteristics that you might want (for example for the identity of a single record to be communicated easily between users of the system, or for a single record to be deleted or updated easily). 您可能需要（例如，单个记录的标识可以在系统用户之间轻松传递，或者单个记录可以轻松删除或更新）。 However for an Oracle system a ROWID might very well do for that as long as it doesn't matter if it occasionally changes. 但是对于Oracle系统来说，ROWID可能会很好地做到这一点，只要它偶尔会发生变化并不重要。

Really though, there's so little overhead in maintaining a single synthetic key that you might as well do it anyway. 实际上，维护单个合成密钥的开销很小，无论如何都可以这样做。 You might choose not to index it, as the index is going to be a much larger resource consumer than the column itself. 您可以选择不对其进行索引，因为索引将比列本身更大的资源使用者。

Not having a unique identifier for each row is even worse than it first seems. 没有每行的唯一标识符甚至比它看起来更糟糕。 Sure, it is precarious and it's easy to inadvertently delete some rows. 当然，它是不稳定的，很容易无意中删除一些行。

But performance is much worse too. 但表现也差得多。 Each time you end up asking the database to get you the rows for Employees with EmployeeType = 'Manager' you are doing a string comparison. 每当您最终要求数据库为EmployeeType = 'Manager' Employees获取行时，您正在进行字符串比较。 Identifiers are just faster and better. 标识符更快更好。

Besides, storage is cheap and in this case I imagine the impact on space will be less than a quarter percentage point if that--as a data warehouse you are probably designing for terabytes of data. 此外，存储是便宜的，在这种情况下，我想如果那样，对空间的影响将不到25个百分点 - 作为数据仓库，您可能正在设计数TB的数据。

http://www.ralphkimball.com/html/controversies.html http://www.ralphkimball.com/html/controversies.html

Fable: 寓言：

The primary key of a fact table consists of all the referenced dimension foreign keys. 事实表的主键包含所有引用的维度外键。

Fact: 事实：

A fact table often has 10 or more foreign keys joining to the dimension tables' primary keys. 事实表通常有10个或更多外键连接到维度表的主键。 However, only a subset of the fact table's foreign key references is typically needed for row uniqueness. 但是，行唯一性通常只需要事实表的外键引用的子集。 Most fact tables have a primary key that consists of a concatenated/composite subset of the foreign keys. 大多数事实表都有一个主键，它由外键的连接/复合子集组成。