简体   繁体   English

如何清除在SQL Server 2008中违反主键的数据?

[英]How do I clean up data that would violate a Primary Key in SQL Server 2008?

I have some crappy data from a source I cannot control, it needs to go into a table with a composite primary key that looks like this: 我从无法控制的来源中获取了一些糟糕的数据,它需要进入一个带有复合主键的表,如下所示:

PK_Part1, PK_Part2, StringData, DateData

My crappy data has full duplicates, PK Duplicates with different StringData, PK Duplicates with different DateData, and PK Duplicates with different StringData and DateData. 我糟糕的数据具有完全重复项,具有不同StringData的PK重复项,具有不同DateData的PK重复项,以及具有不同StringData和DateData的PK重复项。

So I might see: 所以我可能会看到:

1234,1234,Blah,2011-1-1
1234,1234,Blah,2011-1-1
4321,4321,Blah,2011-1-1
4321,4321,Blah,2011-10-10
5678,5678,Blah,2011-1-1
5678,5678,Blah1,2011-1-1
8765,8765,Blah,2011-1-1
8765,8765,Blah,2011-10-10
8765,8765,Blah1,2011-10-10

How so I clean this up in SQL Server 2008? 如何在SQL Server 2008中进行清理? given that: 鉴于:
A) I want only the data associated with the latest date A)我只想要与最新日期关联的数据
B) I'm trying to force the issue with the source about the string data, but for now longer string is better, same length either will do. B)我正在尝试强制有关字符串数据的来源问题,但就目前而言,更好的字符串是更好的,同样的长度也可以。
C) I have to assume the source will be of no help and load everything now C)我必须假设源没有帮助,现在加载所有内容

I had hoped to use MERGE but it seems to compare all rows of the Source table and Target table before doing any of the 'MATCH' or 'NO MATCH' statements so I got PK violations, and removing the PK constraint let all the duplicates in. 我曾希望使用MERGE但是它似乎在执行任何“ MATCH”或“ NO MATCH”语句之前先比较Source表和Target表的所有行,因此我遇到了PK违例,并删除了PK约束,让所有重复项都在。

If you don't have that data in SQL Server already: BULK INSERT that into a temporary table: 如果您在SQL Server中还没有该数据: BULK INSERT到临时表中:

CREATE TABLE #tempStaging
(PK_Part1 INT, PK_Part2 INT, StringData VARCHAR(500), DateData DATE)

BULK INSERT #tempStaging
FROM 'c:\yourfile.txt'
WITH (FIELDTERMINATOR =',',
     ROWTERMINATOR ='\n')

Then you should be able to do something like: 然后,您应该可以执行以下操作:

;WITH CleaupData AS
(
  SELECT 
      PK_Part1, PK_Part2, StringData, DateData,
      ROW_NUMBER() OVER(PARTIION BY PK_Part1, PK_Part2
                        ORDER BY DateData DESC, LEN(StringData) DESC) as 'RowNum'
  FROM
      #tempStaging
)
INSERT INTO dbo.YourTargetTable(PK_Part1, PK_Part2, StringData, DateData)
    SELECT PK_Part1, PK_Part2, StringData, DateData 
    FROM CleanupData
    WHERE RowNum = 1

This will "partition" your data based on some criteria (some ID or something), and each partition of data is order by date (descending - newest first). 这将根据某些条件(某些ID或某些名称)对数据进行“分区”,并且每个数据分区均按日期排序(降序-最新)。

So the entry with the RowNum = 1 is the newest entry for each partition - pick that one and toss out all others, and your data is cleaned up ! 因此, RowNum = 1的条目是每个分区的最新条目-选择该分区并扔掉所有其他分区,然后清除您的数据!

HINT: this assumes that your target table is empty! 提示:这假设您的目标表为空! If that's not the case, then yes - you might need to apply a MERGE statement instead, based on the CTE that selects out the data to keep from the BULK INSERT . 如果不是这种情况,那么可以-您可能需要基于CTE来选择MERGE语句,该CTE从BULK INSERT中选择要保留的数据。

The data form the source should go into a temp table, a holding temp area. 来自源的数据应该进入一个临时表,即一个临时温度区域。 Then you can choose the best one from that (since your sample data contains duplicate part1+part2 even within the input data) 然后,您可以从中选择最好的一个(因为样本数据即使在输入数据中也包含重复的part1 + part2)

Sample table and temp table 样品表和温度表

create table pkdup(
    PK_Part1 int, PK_Part2 int, StringData varchar(100), DateData datetime,
    primary key (PK_Part1,PK_Part2))
insert pkdup select 1234,1234,'', GETDATE()+1000

create table #tmp(col1 nvarchar(max), col2 nvarchar(max), col3 nvarchar(max), col4 datetime)
insert #tmp values
(1234,1234,'Blah','2011-1-1'),
(1234,1234,'Blah','2011-1-1'),
(4321,4321,'Blah','2011-1-1'),
(4321,4321,'Blah','2011-10-10'),
(5678,5678,'Blah','2011-1-1'),
(5678,5678,'Blah1','2011-1-1'),
(8765,8765,'Blah','2011-1-1'),
(8765,8765,'Blah','2011-10-10'),
(8765,8765,'Blah1','2011-10-10');

The merge statement 合并语句

merge pkdup as target
using (
    select col1, col2, col3, col4
    from (select *, row_number() over (
        partition by col1, col2
        order by col4 desc, len(col3) desc) rownum
        from #tmp) t
    where rownum=1 -- only the best
    ) as source
on source.col1=target.PK_Part1 and source.col2=target.PK_Part2
WHEN MATCHED AND (source.col4 > target.datedata or (source.col4=target.datedata and len(source.col3) > target.stringdata))
    THEN UPDATE SET target.stringdata = source.col3, target.datedata = source.col4
WHEN NOT MATCHED THEN
    INSERT (PK_Part1, PK_Part2, StringData, DateData)
    VALUES (source.col1, source.col2, source.col3, source.col4);

我们通常将此类数据放入登台表中,然后在尝试运行merge语句之前清除登台表中的重复项。

not sure if you can apply a string length function in a join, but if you can, try this: 不知道是否可以在联接中应用字符串长度函数,但是如果可以,请尝试以下操作:

select PK_Part1, PK_Part2, max_date, max_len, first(StringData) as first_string
from        
   (select PK_Part1, PK_Part2, max_date, max(len(StringData)) as max_len
    from table inner join
           (select PK_Part1, PK_Part2, max(DateData) as max_date
           from table
           group by
           PK_Part1, PK_Part2) md
    on table.PK_Part1 = md.PK_Part1 and 
           table.PK_Part2 = md.PK_Part2 and 
           table.DateData = md.max_date
    group by
           PK_Part1, PK_Part2, max_date) ml
   inner join table on 
           table.PK_Part1 = ml.PK_Part1 and 
           table.PK_Part2 = ml.PK_Part2 and 
           table.DateData = ml.max_date and
           len(table.StringData) = ml.max_len
   group by
           PK_Part1, PK_Part2, max_date, max_len

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM