简体   繁体   English

如何在维度表中查找未使用的行

[英]How to find unused rows in a dimension table

I have a dimension table in my database that has grown too large.我的数据库中有一个维度表变得太大了。 With that I mean that is has too many records - over a million - because it grew at the same pace as the linked facts.我的意思是它有太多的记录——超过一百万——因为它的增长速度与相关事实相同。 This is mostly due to a bad design, and I'm trying to clean it up.这主要是由于糟糕的设计,我正在努力清理它。

One of the things I try to do is to remove dimension records which are no longer used.我尝试做的一件事是删除不再使用的维度记录。 The fact tables are regularly maintained and old snapshots are removed.事实表会定期维护并删除旧快照。 Because the dimensions were not maintained like that, there are many rows in the table whose primary key value no longer appears in any of the linked fact tables anymore.因为维度不是这样维护的,表中有很多行的主键值不再出现在任何链接的事实表中。 All the fact tables have foreign key constraints.所有的事实表都有外键约束。

Is there a way to locate table rows whose primary key value no longer appears in any of the tables which are linked with a foreign key constraint?有没有办法定位主键值不再出现在与外键约束链接的任何表中的表行?

I tried writing a script to track this.我试着写一个脚本来跟踪这个。 Basically this:基本上是这样的:

select key from dimension 
where not exists (select 1 from fact1 where fk = pk) 
and not exists (select 1 from fact2 where fk = pk) 
and not exists (select 1 from fact3 where fk = pk)

But with a lot of linked tables this query dies after some time - at least, my management studio crashed.但是由于有很多链接表,这个查询在一段时间后就会消失——至少,我的管理工作室崩溃了。 So I'm not sure if there are any other options.所以我不确定是否还有其他选择。

we had to do something similar to this at one of my clients.我们不得不在我的一位客户身上做类似的事情。 The query, like yours with "not exists.... and not exists.... and not exists...." was taking ~22 hours to run before we change our strategy to handle this in ~20 minutes.查询,就像你的“不存在......并且不存在......并且不存在......”需要大约 22 小时才能运行,然后我们更改策略以在大约 20 分钟内处理此问题。

As Nsousa suggest, you have to split the query so SQL Server doesn't have to handle all data in one shot, having to unnecessarily use tempdb and all other things.正如 Nsousa 建议的那样,您必须拆分查询,这样 SQL Server 就不必一次性处理所有数据,而不必不必要地使用 tempdb 和所有其他东西。

First, create new table with all keys in it.首先,创建包含所有键的新表。 The reason to create this table is to not have to read the full table scan for every query, having more keys on a 8k page and to deal with a smaller and smaller set of keys after each delete.创建此表的原因是不必为每个查询读取全表扫描,在 8k 页面上有更多键,并且在每次删除后处理越来越小的键集。

create table DimensionkeysToDelete (Dimkey char(32) primary key nonclustered);
insert into DimensionkeysToDelete 
select key from dimension order by key; 

Then, instead of deleting unused key, delete the keys that exists in facts table, beginning with the fact table that has the least numbers of rows.然后,不是删除未使用的键,而是删除事实表中存在的键,从行数最少的事实表开始。 Make sure facts table have proper indexing for performance.确保事实表具有适当的性能索引。

delete from DimensionkeysToDelete 
from DimensionkeysToDelete d 
inner join fact1 on f.fk = d.Dimkey;

delete from DimensionkeysToDelete 
from DimensionkeysToDelete d 
inner join fact2 on f.fk = d.Dimkey;

delete from DimensionkeysToDelete 
from DimensionkeysToDelete d 
inner join fact3 on f.fk = d.Dimkey;

Once all facts tables done, only unused keys remains in DimensionkeysToDelete.完成所有事实表后,DimensionkeysToDelete 中仅保留未使用的键。 To answers your question, just perform a select on this table to get all unused key for that particular dimension, or join it with the dimension to get data.要回答您的问题,只需在此表上执行选择以获取该特定维度的所有未使用的键,或将其与维度连接以获取数据。

But, from what I understand of your needs for cleaning up you warehouse, use this table to delete from the orignal dimension table.但是,根据我对您清理仓库的需求的了解,使用此表从原始维度表中删除。 At this step, you might also want take some action for auditing purposes (ie: insert in an audit table 'Key ' + key + ' deleted on + convert(datetime, getdate(),121) + ' by script X'.... )在这一步,您可能还想为审计目的采取一些措施(即:插入审计表 'Key' + key + ' 在 + convert(datetime, getdate(),121) + ' 上删除了脚本 X'.. ..)

I think this can be optimize, take a look at the execution plan, but my client was happy with it so we didn't have to put much effort in it.我认为这可以优化,看看执行计划,但是我的客户对此很满意,所以我们不必付出太多努力。

You may want to split that into different queries.您可能希望将其拆分为不同的查询。 Check unused rows in fact1, then on fact2, etc, individually.检查 fact1 中未使用的行,然后分别检查 fact2 等。 Then intersect all those results to get to the rows that are unused in all fact tables.然后将所有这些结果相交以获得所有事实表中未使用的行。

I would also suggest a left outer join instead of nested queries, counting rows in the fact table for each pk, and filter out from the resultset those that have a non zero count.我还建议使用左外连接而不是嵌套查询,为每个 pk 计算事实表中的行,并从结果集中过滤掉那些计数非零的行。

Your query will struggle as it'll scan every fact table at the same time.您的查询将很困难,因为它会同时扫描每个事实表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM