简体繁体 English

将未完成的数据加载到数据仓库

[英]Loading unfinished data to data warehouse

原文 2012-09-03 21:16:32 9 1 sql-server/ data-warehouse/ business-intelligence

The title might be confusing so I'd like to present my current problem. 标题可能令人困惑，所以我想介绍一下我当前的问题。

Please image the following situation: System stores devices' issues, which should be fixed by qualified workers. 请说明以下情况：系统存储设备的问题，应由合格的工人解决。 I have table "issue" with: 我有表“问题”与：

id as PK id为PK
workerid FK FID工人
status which desribes whether the problem is solved or unsolved 描述问题已解决还是未解决的状态
estimated completion time 预计完成时间
real completion time 实际完成时间

and other columns. 和其他列。 I have also a data warehouse which will store the "issues" and describe performance of those "workers" (working time mostly). 我还有一个数据仓库，用于存储“问题”并描述那些“工人”的表现（主要是工作时间）。

During the ETL process the biggest problem comes with "unsolved issues". 在ETL过程中，最大的问题来自“未解决的问题”。 I might have two possibilities: 我可能有两种可能性：

a) process only solved "issues", leave unsolved until they are finished then wait until they are finished and process them. a）仅处理已解决的“问题”，待解决后再解决，然后等待直到完成并处理它们。 This task however will not include in my reports issues, that might take too long to finish, which might be crucial in business aspect. 但是，此任务不会在我的报告中包含可能需要很长时间才能完成的问题，这在业务方面可能至关重要。

b) process both solved and unsolved issues, the PK in Fact table could be issueId and status. b）处理已解决和未解决的问题，事实表中的PK可以是issueId和status。 But then i'll store almost identical issues which might be weird, and difficult to analize. 但是，然后我将存储几乎相同的问题，这些问题可能很奇怪并且难以分析。

Is this common situation? 这是常见情况吗？ Which of these two possibilities seems more reasonable? 这两种可能性中的哪一种似乎更合理？ Or probably there is other, better way to do this? 也许还有其他更好的方法可以做到这一点？

1 个解决方案

It seems like there should be an issues dimension, and that dimension would hold the status column. 似乎应该有一个问题维度，该维度将包含状态列。 There are a couple of issues with changing facts: 事实变化有两个问题：

You are going to have to setup a scheduled process that updates the status column of the fact table every x minutes. 您将必须设置一个计划的进程，该进程每x分钟更新事实表的状态列。 I always try to avoid updating a fact table, as it makes cube processing more difficult, it can introduce blocking, and change tracking is difficult (when did the status change, who changed it, and why?). 我总是尝试避免更新事实表，因为它会使多维数据集处理更加困难，它可能会引入阻塞，并且更改跟踪也很困难（状态何时更改，谁更改了，以及为什么？）。 Additionally, if/when you upgrade to SQL 2012 and want to use column-store indexes (which have revolutionized star schema query performance), you won't be able to directly update the column. 此外，如果/当您升级到SQL 2012并想使用列存储索引（彻底改变了星型架构查询性能）时，将无法直接更新列。
Dimensions are sometimes expected to change. 尺寸有时会发生变化。 Facts are not. 事实并非如此。 If the status is in the dimension, it's also easy to set up change tracking. 如果状态在维度中，则设置更改跟踪也很容易。 Look into slowly changing dimensions. 研究尺寸变化缓慢的问题。