如何更快地删除重复项？

Question

On a table with about 1.7M rows, I tried to delete duplicates posts: 在一个大约1.7M行的表上，我试图删除重复的帖子：

delete a FROM comment a
  INNER JOIN comment a2
     WHERE a.id < a2.id
     AND   a.body = a2.body;

The result was: 结果是：

  Query OK, 35071 rows affected (5 hours 36 min 48.79 sec)

This happened on my almost idle workstation with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz . 这发生在我几乎空闲的工作站上，配备Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz 。 I'm wondering if there are some tricks to make this delete operation faster? 我想知道是否有一些技巧可以让这个删除操作更快？

Answer 1

For MySQL specifically you can try (assuming rows have the exact same columns information): 对于MySQL，你可以尝试（假设行具有完全相同的列信息）：

ALTER IGNORE TABLE comment ADD UNIQUE INDEX idx_name (id, body);

Source 资源

Answer 2

The below query will be useful for you. 以下查询对您有用。

Delete  YourTableName 
From    (
Select  row_number() over(Partition by ColName1,ColName2,ColName3 order by ColName1,ColName2,ColName3 Asc)As RowNumber
        )YourTableName
Where   YourTableName.RowNumber>1

if it's working kindly mark as answer 如果它的工作友好地标记为答案

Answer 3

Your query is attempting a zillion deletes for the same row. 您的查询正在尝试对同一行进行大量删除。 For instance, if you have this data: 例如，如果您有这些数据：

body   id
  a     1
  a     2
  a     3
  a     4

Then your query attempts the following deletions: 然后您的查询尝试以下删除：

 c.body   c.id  c2.id
  a         1      4
  a         1      3
  a         1      2
  a         2      4
  a         2      3
  a         3      4

You can see how this would result in lots of work for the database, as the number of id s on a given body increase. 您可以看到这将如何导致数据库的大量工作，因为给定body上的id数量增加。

You can fix this using group by instead: 您可以使用group by来解决此问题：

delete c 
    from comment c join
         (select c2.body, max(c2.id) as max_id
          from comment c2
          group by c2.body
         ) c2
         on c2.body = c.body and c.id < c2.max_id;

In addition, you want an index on comment(body, id) . 此外，您需要comment(body, id)的索引。

You might also find that an anti-join works better than the join you are attempting: 您可能还会发现反连接比您尝试的连接更有效：

delete c 
    from comment c left join
         comment c2
         on c2.body = c.body and c2.id > c.id
    where c2.id is null;

如何更快地删除重复项？

问题描述

3 个解决方案

解决方案1
0 2019-06-07 10:01:09

解决方案2
0 2019-06-07 10:20:38

解决方案3
0 2019-06-07 10:32:34

如何更快地删除重复项？

问题描述

3 个解决方案

解决方案1 0 2019-06-07 10:01:09

解决方案2 0 2019-06-07 10:20:38

解决方案3 0 2019-06-07 10:32:34

解决方案1
0 2019-06-07 10:01:09

解决方案2
0 2019-06-07 10:20:38

解决方案3
0 2019-06-07 10:32:34