[英]Deduplication of imported records in SQL server
我有以下T_SQL存储过程,该过程目前占用对新导入的记录到我们的后端分析套件中的所有进程运行所有进程所需的总时间的50%。 不幸的是,该数据每次都需要导入,并且随着数据库大小的增加而造成瓶颈。
基本上,我们试图识别记录中的所有重复项,并仅保留其中之一。
DECLARE @status INT
SET @status = 3
DECLARE @contactid INT
DECLARE @email VARCHAR (100)
--Contacts
DECLARE email_cursor CURSOR FOR
SELECT email FROM contacts WHERE (reference = @reference AND status = 1 ) GROUP BY email HAVING (COUNT(email) > 1)
OPEN email_cursor
FETCH NEXT FROM email_cursor INTO @email
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT @email
UPDATE contacts SET duplicate = 1, status = @status WHERE email = @email and reference = @reference AND status = 1
SELECT TOP 1 @contactid = id FROM contacts where reference = @reference and email = @email AND duplicate = 1
UPDATE contacts SET duplicate =0, status = 1 WHERE id = @contactid
FETCH NEXT FROM email_cursor INTO @email
END
CLOSE email_cursor
DEALLOCATE email_cursor
我已经添加了从查询执行计划中可以看到的所有索引,但是有可能更新整个SP以使其以不同的方式运行,因为我已经设法与其他人一起运行。
使用此单个查询进行重复数据删除。
;with tmp as (
select *
,rn=row_number() over (partition by email, reference order by id)
,c=count(1) over (partition by email, reference)
from contacts
where status = 1
)
update tmp
set duplicate = case when rn=1 then 0 else 1 end
,status = case when rn=1 then 1 else 3 end
where c > 1
;
它只会where status=1
的记录之间进行重复数据删除,并将具有相同组合(电子邮件,引用)的行视为重复项。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.