简体   繁体   English

使用唯一索引删除重复项

[英]Removing duplicates with unique index

I inserted between two tables fields A,B,C,D, believing I had created a Unique Index on A,B,C,D to prevent duplicates.我在两个表字段 A、B、C、D 之间插入,相信我在 A、B、C、D 上创建了一个唯一索引以防止重复。 However I somehow simply made a normal index on those.然而,我以某种方式简单地对这些做了一个正常的索引。 So duplicates got inserted.因此插入了重复项。 It is 20 million record table.它是2000万条记录表。

If I change my existing index from normal to unique or simply a add a new unique index for A,B,C,D will the duplicates be removed or will adding fail since unique records exist?如果我将现有索引从正常更改为唯一索引,或者只是为 A、B、C、D 添加一个新的唯一索引,重复项会被删除还是因为存在唯一记录而添加失败? I'd test it yet it is 30 mil records and I neither wish to mess the table up or duplicate it.我会测试它,但它有 3000 万条记录,我不想弄乱表格或复制它。

If you have duplicates in your table and you use如果您的表中有重复项并且您使用

ALTER TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);

the query will fail with Error 1062 (duplicate key).查询将因错误 1062(重复键)而失败。

But if you use IGNORE但是如果你使用IGNORE

-- (only works before MySQL 5.7.4)
ALTER IGNORE TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);

the duplicates will be removed.重复项将被删除。 But the documentation doesn't specify which row will be kept:但文档没有指定将保留哪一行:

  • IGNORE is a MySQL extension to standard SQL. IGNORE是标准 SQL 的 MySQL 扩展。 It controls how ALTER TABLE works if there are duplicates on unique keys in the new table or if warnings occur when strict mode is enabled.如果新表中的唯一键存在重复项,或者在启用严格模式时出现警告,它会控制ALTER TABLE工作方式。 If IGNORE is not specified, the copy is aborted and rolled back if duplicate-key errors occur.如果未指定IGNORE ,则在发生重复键错误时中止并回滚副本。 If IGNORE is specified, only one row is used of rows with duplicates on a unique key.如果指定了IGNORE则在唯一键上具有重复项的行中仅使用一行。 The other conflicting rows are deleted.其他冲突行被删除。 Incorrect values are truncated to the closest matching acceptable value.不正确的值被截断为最接近匹配的可接受值。

    As of MySQL 5.7.4, the IGNORE clause for ALTER TABLE is removed and its use produces an error.从 MySQL 5.7.4 开始,ALTER TABLE 的 IGNORE 子句被删除,它的使用会产生错误。

( ALTER TABLE Syntax ) ( ALTER TABLE 语法)

If your version is 5.7.4 or greater - you can:如果您的版本是 5.7.4 或更高版本 - 您可以:

  • Copy the data into a temporary table (it doesn't technically need to be temporary).将数据复制到临时表中(技术上不需要是临时表)。
  • Truncate the original table.截断原始表。
  • Create the UNIQUE INDEX.创建唯一索引。
  • And copy the data back with INSERT IGNORE (which is still available).并使用INSERT IGNORE (仍然可用)将数据复制回来。
CREATE TABLE tmp_data SELECT * FROM mytable;
TRUNCATE TABLE mytable;
ALTER TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);
INSERT IGNORE INTO mytable SELECT * from tmp_data;
DROP TABLE tmp_data;

If you use the IGNORE modifier, errors that occur while executing the INSERT statement are ignored.如果使用IGNORE修饰符,则执行INSERT语句时发生的错误将被忽略。 For example, without IGNORE , a row that duplicates an existing UNIQUE index or PRIMARY KEY value in the table causes a duplicate-key error and the statement is aborted.例如,如果没有IGNORE ,复制表中现有UNIQUE索引或PRIMARY KEY值的行会导致重复键错误并且语句被中止。 With IGNORE , the row is discarded and no error occurs.使用IGNORE ,该行被丢弃并且不会发生错误。 Ignored errors generate warnings instead.忽略的错误会生成警告。

(INSERT Syntax) (插入语法)

Also see: INSERT ... SELECT Syntax and Comparison of the IGNORE Keyword and Strict SQL Mode另请参阅: INSERT ... SELECT 语法IGNORE 关键字和 Strict SQL 模式的比较

if you think there will be duplicates, adding the unique index will fail.如果您认为会有重复,则添加唯一索引将失败。 first check what duplicates there are:首先检查有哪些重复项:

select * from
(select a,b,c,d,count(*) as n from table_name group by a,b,c,d) x
where x.n > 1

This may be a expensive query on 20M rows, but will get you all duplicate keys that will prevent you from adding the primary index.这可能是对 20M 行的昂贵查询,但会为您提供所有重复的键,这将阻止您添加主索引。 You could split this up into smaller chunks if you do a where in the subquery: where a='some_value'如果在子查询中执行 where,则可以将其拆分为更小的块: where a='some_value'

For the records retrieved, you will have to change something to make the rows unique.对于检索到的记录,您必须进行一些更改以使行唯一。 If that is done (query returns 0 rows) you should be safe to add the primary index.如果这样做(查询返回 0 行),您应该可以安全地添加主索引。

您可以使用 ON DUPLICATE KEY UPDATE 代替 IGNORE,这将使您能够控制哪些值应该占上风。

To answer your question- adding a UNIQUE constraint on a column that has duplicate values will throw an error.要回答您的问题 - 在具有重复值的列上添加UNIQUE约束将引发错误。

For example, you can try the following script:例如,您可以尝试以下脚本:

CREATE TABLE `USER` (
  `USER_ID` INT NOT NULL,
  `USERNAME` VARCHAR(45) NOT NULL,
  `NAME` VARCHAR(45) NULL,
  PRIMARY KEY (`USER_ID`));

INSERT INTO USER VALUES(1,'apple', 'woz'),(2,'apple', 'jobs'),
(3,'google', 'sergey'),(4,'google', 'larry');

ALTER TABLE `USER` 
ADD UNIQUE INDEX `USERNAME_UNIQUE` (`USERNAME` ASC);
/*
Operation failed: There was an error while applying the SQL script to the database.
ERROR 1062: Duplicate entry 'apple' for key 'USERNAME_UNIQUE'
*/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM