简体繁体 English

SQL / SSIS DataWareHouse事实表加载，最佳实践？

[英]SQL/SSIS DataWareHouse Fact table loading, best practices?

原文 2012-11-01 01:04:18 7 1 sql/ sql-server/ ssis/ lookup/ data-warehouse

I am building my first datawarehouse in SQL 2008/SSIS and I am looking for some best practices around loading the fact tables. 我正在用SQL 2008 / SSIS构建我的第一个数据仓库，我正在寻找一些有关加载事实表的最佳实践。

Currently in my DW I have about 20 Dimensions (Offices, Employees, Products, Customer, etc.) that are of Type 1 SCD. 当前在我的DW中，我有大约20个类型1 SCD的维度（办公室，员工，产品，客户等）。 In my dw structure, there are a few things I have already applied: 在我的dw结构中，我已经应用了一些东西：

No Nulls (replaced with blank for text or 0 for numeric during staging) 没有空值（在分阶段期间，文本替换为空白，数字替换为0）
unknown key members populated in each dimension (SK ID 0) 每个维度中填充的未知关键成员（SK ID 0）
UPSERT for SCD Type 1 loading from stage to production table 从台到生产台的SCD Type 1的UPSERT装载
SELECT DISTINCT for my loading of dimensions 选择DISTINCT以加载我的尺寸

In my Fact loading SSIS project, the current method I have for loading dimensions is having multiple lookups (20+) to each of the DIMs, then populating the FACT table with the data. 在我的Fact加载SSIS项目中，当前用于加载尺寸的方法是对每个DIM进行多次查找（20+），然后用数据填充FACT表。

For my lookups I set: 对于我的查找，我设置了：

Full Cache 完整快取
Ignore Failures for "no matching entries" 忽略“无匹配条目”的失败
Derived Transformation with "ISNULL(surrogate_idkey) ? 0 : surrogate_idkey" for each SK so that if lookups fail they will default to the SK ID 0 (unknown member). 对每个SK使用“ ISNULL（surrogate_idkey）？0：surrogate_idkey”进行派生转换，以便在查找失败时将默认为SK ID 0（未知成员）。
Some of my dimension lookups have more than one business key 我的某些维度查询具有多个业务键

Is this the best approach? 这是最好的方法吗？ Pictures attached to help with my description above. 附上图片以帮助我进行上述描述。

在此处输入图片说明

1 个解决方案

Looks fine. 看起来不错 There are options if you start to run into performance issues, but if this is stable (finishes within data-loading time window, source systems aren't being drained of resources, etc), then I see no reason to change. 如果您开始遇到性能问题，则有一些选择，但是如果这是稳定的（在数据加载时间窗口内完成，并且源系统没有耗尽资源等），那么我认为没有理由进行更改。

Some potential issues to keep an eye on... 需要注意的一些潜在问题...

having 20+ full-cache lookup-transforms may pose a problem if your dimensions increase in size...due to memory constraints on the SSIS system...but since they are type 1, I wouldn't worry. 如果您的尺寸增加了（由于SSIS系统上的内存限制），则具有20个以上的全缓存查找转换可能会带来问题。但是由于它们是类型1，所以我不会担心。
full-cache lookups "hydrate" pre-execution...having 20+ of them may slow you down 全缓存查找“水合”预执行...其中有20多个可能会使您慢下来

A common alternative (to what you have above) is to extract the fact table data from the source system and land it in a staging area before doing the dimension key lookups via a single SQL statement. 一种常见的替代方法（相对于上面的方法）是从源系统中提取事实表数据，并将其放在登台区域中，然后通过单个SQL语句进行维度键查找。 Some even keep a set of dimension key mapping tables in the staging area specifically for this purpose. 有些甚至为此专门在登台区域中保留一组维度键映射表。 This reduces locking/blocking on the source system...if you have a lot of data each load, and have to block the source system while you suck the data out and run it through those 20+ lookup transforms. 这样可以减少源系统上的锁定/阻塞...如果每次加载时都有大量数据，并且必须在吸收数据并通过那20多个查找转换运行数据时阻塞源系统。

Having a good staging area strategy becomes more important when you have a large amount of data, large dimensions, complex key mappings (usually due to multiple source systems), and short data-loading time windows. 当您拥有大量数据，大尺寸，复杂的键映射（通常是由于多个源系统）以及较短的数据加载时间窗口时，拥有良好的暂存区策略就变得尤为重要。