简体   繁体   English

如何使用类型2缓慢变化的维度索引表以获得最佳性能

[英]How to index a table with a Type 2 slowly changing dimension for optimal performance

Suppose you have a table with a Type 2 slowly-changing dimension. 假设您有一个具有类型2缓慢变化维度的表。

Let's express this table as follows, with the following columns: 让我们按如下方式表达此表,其中包含以下列:

* [Key]
* [Value1]
* ...
* [ValueN]
* [StartDate]
* [ExpiryDate]

In this example, let's suppose that [StartDate] is effectively the date in which the values for a given [Key] become known to the system. 在这个例子中,假设[StartDate]实际上是系统已知给定[Key]的值的日期。 So our primary key would be composed of both [StartDate] and [Key]. 所以我们的主键将由[StartDate]和[Key]组成。

When a new set of values arrives for a given [Key], we assign [ExpiryDate] to some pre-defined high surrogate value such as '12/31/9999'. 当一组新值到达给定[Key]时,我们将[ExpiryDate]分配给某个预定义的高代理值,例如'12 / 31/9999'。 We then set the existing "most recent" records for that [Key] to have an [ExpiryDate] that is equal to the [StartDate] of the new value. 然后,我们为[Key]设置现有的“最新”记录,使[ExpiryDate]等于新值的[StartDate]。 A simple update based on a join. 基于连接的简单更新。


So if we always wanted to get the most recent records for a given [Key], we know we could create a clustered index that is: 因此,如果我们一直希望获得给定[Key]的最新记录,我们知道我们可以创建一个聚簇索引:

* [ExpiryDate] ASC
* [Key] ASC

Although the keyspace may be very wide (say, a million keys), we can minimize the number of pages between reads by initially ordering them by [ExpiryDate]. 虽然键空间可能非常宽(例如,一百万个键),但我们可以通过[ExpiryDate]最初对它们进行排序来最小化读取之间的页数。 And since we know the most recent record for a given key will always have an [ExpiryDate] of '12/31/9999', we can use that to our advantage. 由于我们知道给定密钥的最新记录将始终具有[12/31/9999]的[ExpiryDate],因此我们可以将其用于我们的优势。

However... what if we want to get a point-in-time snapshot of all [Key]s at a given time? 但是......如果我们想在给定时间获得所有[Key] s的时间点快照怎么办? Theoretically, the entirety of the keyspace isn't all being updated at the same time. 从理论上讲,整个键空间并非全部同时更新。 Therefore for a given point-in-time, the window between [StartDate] and [ExpiryDate] is variable, so ordering by either [StartDate] or [ExpiryDate] would never yield a result in which all the records you're looking for are contiguous. 因此,对于给定的时间点,[StartDate]和[ExpiryDate]之间的窗口是可变的,因此[StartDate]或[ExpiryDate]的排序永远不会产生一个结果,其中您要查找的所有记录都是连续的。 Granted, you can immediately throw out all records in which the [StartDate] is greater than your defined point-in-time. 当然,您可以立即丢弃[StartDate]大于您定义的时间点的所有记录。


In essence, in a typical RDBMS, what indexing strategy affords the best way to minimize the number of reads to retrieve the values for all keys for a given point-in-time? 从本质上讲,在典型的RDBMS中,什么索引策略提供了最佳方法来最小化读取次数以检索给定时间点的所有键的值? I realize I can at least maximize IO by partitioning the table by [Key], however this certainly isn't ideal. 我意识到我可以通过[Key]对表进行分区来最小化IO,但这肯定不太理想。

Alternatively, is there a different type of slowly-changing-dimension that solves this problem in a more performant manner? 或者,是否存在一种不同类型的缓慢变化的维度,以更高效的方式解决这个问题?

Lazy DBA 懒惰的DBA

Are you talking about bringing back all the values in your dimension table? 您是否在谈论恢复维度表中的所有值? If so, then why not add a non-clustered index with additional coverage such that you're only pulling values out of the index itself, rather than from the table? 如果是这样,那么为什么不添加一个具有额外覆盖率的非聚集索引,以便您只从索引本身而不是从表中提取值? That way you're scanning a B-Tree with some attached "covered" values, as opposed to potentially performing a table scan? 这样你就可以使用一些附加的“覆盖”值来扫描B树,而不是可能执行表扫描? I can't vouch for relative performance, but it's worth testing for the scenario you're obviously working on. 我无法保证相对性能,但是值得测试你正在研究的场景。

Cheers 干杯

Ozziemedes http://ozziemedes.blogspot.com/ Ozziemedes http://ozziemedes.blogspot.com/

If this is truly a "slowly changing dimension" table, I would consider a clustered columnstore index. 如果这确实是一个“缓慢变化的维度”表,我会考虑一个聚簇列存储索引。 I know this wasn't available when you asked the question, but anyway. 我知道当你问这个问题时这不可用,但无论如何。 you'll find some great documentation here: " https://msdn.microsoft.com/en-us/library/gg492088.aspx " and here " http://www.nikoport.com/2013/07/05/clustered-columnstore-indexes-part-1-intro/ ". 你会在这里找到一些很棒的文档:“ https://msdn.microsoft.com/en-us/library/gg492088.aspx ”和这里“ http://www.nikoport.com/2013/07/05/clustered -columnstore-indexes-part-1-intro / “。

now if you want to stick to rowstore indexes, if you're inserting the data in table sequentially, what I've done in the past was leveraging an identity field. 现在,如果你想坚持使用行存储索引,如果你按顺序在表中插入数据,那么我过去所做的就是利用一个标识字段。 your queries would be something like: 您的查询将是这样的:

    declare @id;
    select @id = min(ID) from table where date = '12/31/9999';
    select fields from table where key = 112 and id > @id; 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM