简体   繁体   中英

Estimate/speedup huge table self-join on mysql

I have a huge table:

 CREATE TABLE `messageline` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `hash` bigint(20) DEFAULT NULL,
  `quoteLevel` int(11) DEFAULT NULL,
  `messageDetails_id` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
  KEY `hash_idx` (`hash`),
  KEY `quote_level_idx` (`quoteLevel`),
  CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin

I need to find duplicate lines this way:

create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
     messageline ml1
where ml1.hash = ml.hash
  and ml1.messagedetails_id!=ml.messagedetails_id

But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.

Explain:

+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key      | key_len | ref           | rows      | Extra       |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
|  1 | SIMPLE      | ml    | ALL  | hash_idx      | NULL     | NULL    | NULL          | 401798409 |             |
|  1 | SIMPLE      | ml1   | ref  | hash_idx      | hash_idx | 9       | skryb.ml.hash |         1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+

You can find your duplicates like this

SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;

If it is still too long, add a condition to split the request on an indexed field :

WHERE messagedetails_id < 100000

Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:

  1. First run the following query
     CREATE TABLE duplicate_hashes SELECT * FROM ( SELECT hash , GROUP_CONCAT( id ) AS ids, COUNT(*) AS cnt, COUNT(DISTINCT messagedetails_id) AS cnt_message_details, GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1 ) tmp WHERE cnt > cnt_message_details 
    This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id
     where ml1.hash = ml.hash and ml1.messagedetails_id!=ml.messagedetails_id 
  2. Use a script to check each record of the duplicate_hashes table

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM