简体   繁体   中英

How to design and handle exponential growth in fact table?

Here is my scenario with SQLServer 2008 R2 database table

(Update: Migration to SQL Server 2014 SP1 is in progress, so SQL Server 2014 can be used here).

A. Maintain daily history in the table (which is a fact table) B. Create tableau graphs using the fact and dimension tables

A few steps to follow to create the table

  1. A copy of the table from the source database will be pushed to my SQLServer DAILY which contain 120,000 to 130,000 rows with 20 columns approximately

a. 1st day, we get 120,000 records, sample structure is below.

(Modified or New records are highlighted in Yellow)

Source System Data: 源系统数据

b. 2nd day, we get, say 122,000 records (2,000 are newly inserted and 1,000 are modified/updated on previous day's data and 119,000 are as it is from previous day)

c. 3rd day, we get, say 123,000 records (1,000 are newly inserted and 1,000 are modified / updated on 2nd day's data and 121,000 are as it is from 2nd day)

  1. Since the daily history has to be maintained in the Fact table, within a week the table will have 1 million rows,

for 2 weeks - 2 million rows

for 1 month - 5 million rows

for 1 year - say 65 - 70 million rows

for 12 years - say 1 billion rows (1,000 million)

  1. 12 years history has to be maintained

What could be right strategy to store data in the table to handle this scenario, which should also provide sufficient performance while generating reports ?

  • Partitioning the table by month wise (the table will contain 5 million rows approx.) ?
  • Thought of copying the differential data only in the table daily (new and modified rows only) but it is not possible to create tableau reports with Approach-2.

Fact Table Approaches: 事实表方法

Tableau graphs have to created using the fact and dimension tables for scenarios like

  • Weekly Bar graph for Sample Count

  • Weekly (week no. on X-axis) plotter graph for average Sample values (on Y-axis)

  • Weekly (week no. on x-axis) average sample values (on Y-axis) by quality

How to handle this scenario ?

Please provide references on the approach to follow.

Should we create any indexes on the fact table ?

A data warehouse can handle millions of rows these days without a lot of difficulty. Many have tens of billions of rows, and then things get a little difficult. You should look at both table partitioning over time and at columnstore compression and page compression in terms of seeing what is out there. Large warehouses often use both. 2008 R2 is quite old at this point, and note that huge progress has been made in this area in current versions of SQL Server.

Use a standard fact-dimensional design, and try to avoid tweaking the actual schema with workarounds just to conserve space - that generally will bite you in the long run.

For proven, time tested designs in warehousing I like the Kimball group's patterns, eg The Data Warehouse Lifecycle Toolkit book.

There are a few different requirements in your case. Because of that, I suggest splitting the requirements according to the standard data warehouse three-tier model.

  • DWH model (delta-driven, historized, high performance)
  • Presentation model (Again, high performance, should fit Tableau)
  • Front end

DWH model

Basically, you have three different approaches here, all with their pros and cons.

  1. 3NF

Can become cumbersome down the road. Is highly flexible if being used right. Time-to-market is long (depending on complexity). Historization can become complicated.

  1. Star Schema (for DWH storage!)

Has a very, very fast time-to-market. Will become extremely complicated to maintain when business rules or business structure changes. Helpful for a very small business but not in the case of businesses which want to expand their Business Intelligence infrastructure. Historization can become a mess if the star schema is the DWH main model.

  1. Data Vault

Has a medium time-to-market. Is easier to understand than 3NF but can be puzzling for people used to a star schema. Automatically historized, parallelizable and very flexible for changing business needs, because the business rules are implemented downstream. Scales quickly.

  1. Anchor Modelling

Another highly flexible approach which I haven't used yet. Is in some kind the same approach as Data Vault but with some differences.

Presentation model

Now, to represent the never-touched-again data in the DWH layer, nothing fits better than Star Schema . Also, while creating the star schema, you can implement business logic.

Front end

Shouldn't matter, take the tool you like.

In your case, it would be smart to implement a DWH (using one of those models) and put the presentation model on top of it. If any problems are in the star schema, you could always re-generate it with the new changes.

NOTE: If you would use a star schema as a DWH model, you cannot re-create the star schema in the presentation layer without using some complex transformation logic to begin with.

NOTE: Also, sometimes the star schema is seen as a DWH. I don't think that this is a good use for it for any requirement which could become more complex.

EDIT

To clarify my last note, see this blog post: http://www.tobiasmaasland.de/2016/08/24/why-your-data-warehouse-is-not-a-data-warehouse/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM