简体繁体 English

如何设计和处理事实表中的指数增长？

[英]How to design and handle exponential growth in fact table?

原文 2016-08-17 19:46:52 6 2 sql-server/ data-warehouse/ tableau/ star-schema

Here is my scenario with SQLServer 2008 R2 database table 这是我使用SQLServer 2008 R2数据库表的方案

(Update: Migration to SQL Server 2014 SP1 is in progress, so SQL Server 2014 can be used here). （更新：正在进行向SQL Server 2014 SP1的迁移，因此可以在此处使用SQL Server 2014）。

A. Maintain daily history in the table (which is a fact table) B. Create tableau graphs using the fact and dimension tables A.维护表（事实表）中的每日历史记录B.使用事实表和维表创建Tableau图形

A few steps to follow to create the table 创建表要遵循的几个步骤

A copy of the table from the source database will be pushed to my SQLServer DAILY which contain 120,000 to 130,000 rows with 20 columns approximately 来自源数据库的表的副本将被推送到我的SQLServer DAILY，其中包含120,000至130,000行，大约20列

a. 一种。 1st day, we get 120,000 records, sample structure is below. 第一天，我们得到120,000条记录，示例结构如下。

(Modified or New records are highlighted in Yellow) （已修改或新记录以黄色突出显示）

Source System Data: 源系统数据：

b. b。 2nd day, we get, say 122,000 records (2,000 are newly inserted and 1,000 are modified/updated on previous day's data and 119,000 are as it is from previous day) 第二天，我们得到122,000条记录（新插入了2,000条记录，并根据前一天的数据修改/更新了1,000条记录，而前一天的记录则为119,000条记录）

c. C。 3rd day, we get, say 123,000 records (1,000 are newly inserted and 1,000 are modified / updated on 2nd day's data and 121,000 are as it is from 2nd day) 第三天，我们得到123,000条记录（第二天的数据是新插入的1,000条，第二天的数据被修改/更新了1,000条，第二天的数据是121,000条）

Since the daily history has to be maintained in the Fact table, within a week the table will have 1 million rows, 由于必须在Fact表中维护每日历史记录，因此该表在一周内将有100万行，

for 2 weeks - 2 million rows 2周-200万行

for 1 month - 5 million rows 1个月-500万行

for 1 year - say 65 - 70 million rows 1年-例如65-7000万行

for 12 years - say 1 billion rows (1,000 million) 12年-假设有10亿行（10亿）

12 years history has to be maintained 必须保留12年的历史

What could be right strategy to store data in the table to handle this scenario, which should also provide sufficient performance while generating reports ? 什么是将数据存储在表中以应对这种情况的正确策略，在生成报告时还应提供足够的性能？

Partitioning the table by month wise (the table will contain 5 million rows approx.) ? 按月对表进行分区（表将包含约500万行）？
Thought of copying the differential data only in the table daily (new and modified rows only) but it is not possible to create tableau reports with Approach-2. 考虑仅每天在表中复制差异数据（仅新行和修改行），但是无法使用Approach-2创建Tableau报告。

Fact Table Approaches: 事实表方法：

Tableau graphs have to created using the fact and dimension tables for scenarios like 必须使用事实表和维度表来创建Tableau图形，以用于诸如

Weekly Bar graph for Sample Count 每周条形图的样本数量
Weekly (week no. on X-axis) plotter graph for average Sample values (on Y-axis) 每周（X轴上的第几周）绘图仪图，以获取平均样本值（在Y轴上）
Weekly (week no. on x-axis) average sample values (on Y-axis) by quality 按质量划分的每周（x轴上的第几周）平均样品值（Y轴上）

How to handle this scenario ? 如何处理这种情况？

Please provide references on the approach to follow. 请提供有关遵循方法的参考。

Should we create any indexes on the fact table ? 我们应该在事实表上创建任何索引吗？

2 个解决方案

A data warehouse can handle millions of rows these days without a lot of difficulty. 如今，数据仓库可以轻松处理数百万行。 Many have tens of billions of rows, and then things get a little difficult. 许多行都有数百亿行，然后事情变得有些困难。 You should look at both table partitioning over time and at columnstore compression and page compression in terms of seeing what is out there. 您应该同时查看表分区和列存储压缩以及页面压缩，以了解其中的内容。 Large warehouses often use both. 大型仓库经常同时使用两者。 2008 R2 is quite old at this point, and note that huge progress has been made in this area in current versions of SQL Server. 2008 R2在这一点上已经很老了，请注意，在当前版本的SQL Server中，该领域已经取得了巨大的进步。

Use a standard fact-dimensional design, and try to avoid tweaking the actual schema with workarounds just to conserve space - that generally will bite you in the long run. 使用标准的事实维度设计，并尝试避免使用变通办法来调整实际架构，以节省空间-从长远来看，这通常会给您带来麻烦。

For proven, time tested designs in warehousing I like the Kimball group's patterns, eg The Data Warehouse Lifecycle Toolkit book. 对于久经考验的仓储设计，我喜欢Kimball组的模式，例如《数据仓库生命周期工具包》。

There are a few different requirements in your case. 您的情况有一些不同的要求。 Because of that, I suggest splitting the requirements according to the standard data warehouse three-tier model. 因此，我建议根据标准数据仓库三层模型拆分需求。

DWH model (delta-driven, historized, high performance) DWH模型（增量驱动，历史化，高性能）
Presentation model (Again, high performance, should fit Tableau) 演示模型（同样，高性能，应适合Tableau）
Front end 前端

DWH model DWH模型

Basically, you have three different approaches here, all with their pros and cons. 基本上，这里有三种不同的方法，各有优缺点。

3NF 3NF

Can become cumbersome down the road. 将来可能会很麻烦。 Is highly flexible if being used right. 如果使用正确，则具有很高的灵活性。 Time-to-market is long (depending on complexity). 上市时间很长（取决于复杂性）。 Historization can become complicated. 历史化可能变得复杂。

Star Schema (for DWH storage!) 星型模式（用于DWH存储！）

Has a very, very fast time-to-market. 拥有非常非常快的上市时间。 Will become extremely complicated to maintain when business rules or business structure changes. 当业务规则或业务结构发生变化时，维护将变得极为复杂。 Helpful for a very small business but not in the case of businesses which want to expand their Business Intelligence infrastructure. 对于很小的企业很有用，但对于想要扩展其商业智能基础结构的企业却没有帮助。 Historization can become a mess if the star schema is the DWH main model. 如果星形模式是DWH主模型，则历史化可能变得一团糟。

Data Vault 资料库

Has a medium time-to-market. 上市时间中等。 Is easier to understand than 3NF but can be puzzling for people used to a star schema. 比3NF更容易理解，但对于习惯星型模式的人们可能会感到困惑。 Automatically historized, parallelizable and very flexible for changing business needs, because the business rules are implemented downstream. 由于业务规则是在下游实施的，因此可以自动进行历史化，可并行化并且非常灵活，以适应不断变化的业务需求。 Scales quickly. 快速扩展。

Anchor Modelling 锚模型

Another highly flexible approach which I haven't used yet. 我尚未使用的另一种高度灵活的方法。 Is in some kind the same approach as Data Vault but with some differences. 在某种程度上与Data Vault是相同的方法，但有一些区别。

Presentation model 展示模型

Now, to represent the never-touched-again data in the DWH layer, nothing fits better than Star Schema . 现在，要表示DWH层中从未触及的数据，再没有比Star Schema更适合的了。 Also, while creating the star schema, you can implement business logic. 同样，在创建星形模式时，您可以实现业务逻辑。

Front end 前端

Shouldn't matter, take the tool you like. 没关系，请使用您喜欢的工具。

In your case, it would be smart to implement a DWH (using one of those models) and put the presentation model on top of it. 在您的情况下，实现DWH（使用这些模型之一）并在其之上放置表示模型将是明智的。 If any problems are in the star schema, you could always re-generate it with the new changes. 如果星型模式存在任何问题，您始终可以使用新更改重新生成它。

NOTE: If you would use a star schema as a DWH model, you cannot re-create the star schema in the presentation layer without using some complex transformation logic to begin with. 注意：如果将星形模式用作DWH模型，则必须先使用一些复杂的转换逻辑才能在表示层中重新创建星形模式。

NOTE: Also, sometimes the star schema is seen as a DWH. 注意：此外，有时星型模式也被视为DWH。 I don't think that this is a good use for it for any requirement which could become more complex. 对于任何可能变得更加复杂的要求，我都不认为这是一个好用处。

EDIT 编辑

To clarify my last note, see this blog post: http://www.tobiasmaasland.de/2016/08/24/why-your-data-warehouse-is-not-a-data-warehouse/ 为了澄清我的最后一个笔记，请参阅此博客文章： http : //www.tobiasmaasland.de/2016/08/24/why-your-data-warehouse-is-not-a-data-warehouse/