简体   繁体   English

如何从数据库中删除重复项?

[英]how do I remove duplicates from a database?

I have a table with four fields: ID auto increment, a string, and two integers. 我有一个包含四个字段的表:ID自动递增,一个字符串和两个整数。 I want to do something of the sort: 我想做一些事情:

     select count(*) from table group by string

and then use the result to consolidate all counts which are larger than 1. 然后使用结果合并所有大于1的计数。

That is, take all rows which have count larger than 1, and replace all of these rows in the database (which have the same string) with a single row, ID does not matter, and the two integers are the sum over all rows of all of the rows with count larger than 1. 也就是说,取所有计数大于1的行,并将数据库中所有这些行(具有相同字符串)替换为单行,ID无关紧要,并且两个整数是该行所有行的总和所有计数大于1的行。

Is that possible using a few simple queries? 使用几个简单的查询是否有可能?

Thanks. 谢谢。

I would suggest to insert into temporary table data grouped by string AND accompanied by min(id) where there are duplicates. 我建议将按字符串AND并在有重复项的min(id)分组的临时表中插入数据。 Then update original table with sums where id = min(id), and delete where strings match but ids don't. 然后,使用id = min(id)的总和更新原始表,并删除字符串匹配但id不匹配的地方。

 insert into temp
 select string, min(id) id, sum(int1) int1, sum(int2) int2
   from table
  group by string
 having count(*) > 1

 update table, temp
   set table.int1 = temp.int1,
       table.int2 = temp.int2
 where table.id = temp.id
-- Works because there is only one record given a string in temp
 delete table
  where exists (select null from temp where temp.string = table.string and temp.id <> table.id)

Backup is mandatory :-) and a transaction also. 备份是必不可少的:-),交易也是。

There's a simple way to do this. 有一个简单的方法可以做到这一点。 Just place something like 只需放置类似

id NOT IN (select id from table group by string)

in your where statement, which will select only duplicates 在您的where语句中,它将仅选择重复项

Start by selecting just the ones with count > 0 , and selecting the sums that you want: 首先选择count > 0的那些,然后选择所需的总和:

select * from (
    select count(*), string_col, sum(int_col_1), sum(int_col_2)
    from my_table
    group by string_col
) as foo where count > 1

After that, I would put that data into a temporary table, delete the rows you don't want, and insert the data from the temp table into the original one. 之后,我将这些数据放入临时表中,删除不需要的行,然后将数据从临时表插入到原始表中。

You can do it all in a two queries, no temp tables. 您可以在两个查询中完成所有操作,而无需临时表。 But you need to run the DELETE query repeatedly since it will only delete 1 duplicate at a time. 但是您需要重复运行DELETE查询,因为它一次只能删除1个重复项。 So if there are 3 copies of a row, you would need to run it twice. 因此,如果一行中有3个副本,则需要运行两次。 But you can just run it until there are no more results. 但是您可以运行它直到没有更多结果。

Update the duplicate rows you are going to keep to contain the count/sum. 更新要保留的重复行以包含计数/总和。

UPDATE tablename JOIN (
   SELECT min(id) id,sum(int1) int1,sum(int2) int2 
   FROM tablename GROUP BY string HAVING c>1
) AS dups ON tablename.id=dups.id
SET tablename.int1=dups.int1, tablename.int2

Then you can use that same SELECT query in a DELETE query, using the multiple-table syntax. 然后,您可以使用多表语法在DELETE查询中使用相同的SELECT查询。

DELETE tablename FROM tablename 
JOIN (SELECT max(id) AS id,count(*) c FROM tablename GROUP BY string HAVING c>1) dups
ON tablename.id=dups.id

Just run that DELETE until there are no rows returned (0 affected rows). 只需运行DELETE,直到没有返回行(受影响的行为0)。

If you can stop the table from being updated by other users, then it's pretty easy. 如果您可以阻止该表被其他用户更新,则非常简单。

-- We're going to add records before deleting old ones, so keep track of which records are old.
DECLARE @OldMaxID INT
SELECT @OldMaxID = MAX(ID) FROM table

-- Combine duplicate records into new records
INSERT table (string, int1, int2)
SELECT string, SUM(int1), SUM(int2)
FROM table
GROUP BY string
HAVING COUNT(*) > 1

-- Delete records that were used to make combined records.
DELETE FROM table
WHERE ID <= @OldMaxID
GROUP BY string
HAVING COUNT(*) > 1

You can derive this information in a VIEW: 您可以在VIEW中获得此信息:

 CREATE VIEW SummarizedData (StringCol, IntCol1, IntCol2, OriginalRowCount) AS
    SELECT StringCol, SUM(IntCol1), SUM(IntCol2), COUNT(*)
    FROM TableName
    GROUP BY StringCol

This will create a virtual table with the information you want. 这将创建一个包含所需信息的虚拟表。 It will include the rows for which there was only one instance of StringCol values as well — if you really don't want those add the phrase HAVING COUNT(*) > 1 to the end of the query. 它将包括也只有一个StringCol值实例的行-如果您确实不希望这些行在查询末尾添加短语HAVING COUNT(*) > 1

With this method you can maintain the original table and just read from the summarized data or you can create an empty table structure with the appropriate columns and INSERT from SummarizedData into your new table to get a "real" table with the data. 使用这种方法,您可以维护原始表并仅从汇总数据中读取,也可以创建具有适当列的空表结构,然后将SummarizedData INSERT到新表中以获取包含数据的“真实”表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM