[英]Is there any performance impact on having NULLs on Foreign key column in a Data mart
We are currently working on Data mart design. 我们目前正在致力于数据集市的设计。 We are having many Foreign keys to dimension tables. 我们有许多维度表的外键 。 We are thinking whether to allow NULL
in Foreign key dimension fields or have -1 to represent NULL
values. 我们正在考虑是NULL
键维字段中允许NULL
还是让-1代表NULL
值。
Kimball suggests to keep default row for NULL
values. Kimball建议为NULL
值保留默认行。 http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/fact-table-null/ http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/Dimension-modeling-techniques/fact-table-null/
My lead suggests to keep NULL
as NULL
. 我的领导建议将NULL
保留为NULL
。
Will there be any performance impact for keeping NULL
in Foreign key fields ? 在外键字段中保持NULL
会对性能产生影响吗?
Kimball is right (as he usually is). Kimball是对的(像往常一样)。 Use a default value where you would use NULL
. 在使用NULL
地方使用默认值。
Why? 为什么? It ensures that joins to the dimensions will not "accidentally" filter rows. 它确保连接到维度不会“意外”过滤行。 Trying to reconcile results from different queries eats up a lot of time. 尝试协调来自不同查询的结果会占用大量时间。 Ensuring that joins succeed is one method of reducing such discrepancies. 确保连接成功是减少此类差异的一种方法。
If you are not going to follow his advice, then store using NULL
. 如果您不遵循他的建议,请使用NULL
进行存储。 A value such as -1
is particularly bad -- because it prevents the database from enforcing foreign key constraints. 诸如-1
值特别糟糕-因为它阻止数据库强制执行外键约束。
Another reason to avoid using NULL that Gordon hasn't covered: it's unclear what NULL means. 戈登还没有介绍避免使用NULL的另一个原因:不清楚NULL的含义。
Sometimes you have a NULL in a data mart or data warehouse because something has gone wrong in the ETL or in a source system, leading to a NULL. 有时,由于ETL或源系统中出现了某些问题而导致NULL,因此数据集市或数据仓库中存在NULL。 Other times you have a NULL because that column doesn't apply to that particular row. 有时候,您使用NULL,因为该列不适用于该特定行。 Or in the case of something like an accumulating snapshot table, because that column has not been populated yet, as the process being reported on hasn't yet reached the point where that column will be populated. 或在诸如累积快照表之类的情况下,因为尚未填充该列,因为所报告的进程尚未到达将要填充该列的地步。
Rather than a single default value I like to set up multiple; 我不想设置多个默认值,而不是设置一个默认值。 for instance, you can set up every dimension to have a row that indicates "Unknown" which you might use for missing values, and a row that indicates "N/A" for cases where the value does not apply. 例如,您可以将每个维度设置为具有一个指示“未知”的行(可能会用于缺失值),以及一个指示“不适用”的行(如果该值不适用)。 I tend to set these up with negative integers for keys (-1 is Unknown, -2 is N/A, etc.), as that allows me to use the same keys for these rows in every table. 我倾向于使用负整数来设置键(-1是Unknown,-2是N / A,等等),因为这允许我为每个表中的这些行使用相同的键。 But as both Kimball and Gordon indicate, you should actually create those rows in your dimensions. 但是,正如Kimball和Gordon所指出的那样,您实际上应该在维度中创建这些行。
This makes it really easy to run data quality checks looking for cases where something has gone wrong. 这使得运行数据质量检查以查找出现问题的情况变得非常容易。 It means you can display some meaningful values in any reporting or analysis tools so people can filter out rows that haven't fully populated if they want to, or so your data stewards can look for problematic data via those tools. 这意味着您可以在任何报告或分析工具中显示一些有意义的值,以便人们可以过滤掉尚未完全填充的行,或者您的数据管理员可以通过那些工具查找有问题的数据。 Or perhaps people might want to specifically look for those rows where one of the dimensions isn't applicable. 也许人们可能想专门寻找其中某一维度不适用的那些行。
If you have a situation where data sometimes loads in the "wrong" order (ie a fact table gets populated, but relevant dimension members haven't been added a dimension yet), you can also use this to check for rows that need updating in your ETL and automate fixing the issue, without repeatedly trying to update those rows that do not need updating because they will always have a NULL. 如果您遇到数据有时以“错误”顺序加载的情况(即填充了事实表,但尚未向相关维成员添加维),则也可以使用它来检查需要更新的行您的ETL并自动解决此问题,而无需反复尝试更新那些不需要更新的行,因为它们将始终为NULL。
And down the line when someone else takes over support of this data mart, they'll be really thankful when they don't have to spend a huge amount of time unpicking whether those NULLs or -1s indicate a problem. 当其他人接管该数据集市的支持时,当他们不必花费大量时间来选择这些NULL或-1是否表示问题时,他们将非常感激。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.