Estimate/speedup huge table self-join on mysql

Question

I have a huge table:

 CREATE TABLE `messageline` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `hash` bigint(20) DEFAULT NULL,
  `quoteLevel` int(11) DEFAULT NULL,
  `messageDetails_id` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
  KEY `hash_idx` (`hash`),
  KEY `quote_level_idx` (`quoteLevel`),
  CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin

I need to find duplicate lines this way:

create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
     messageline ml1
where ml1.hash = ml.hash
  and ml1.messagedetails_id!=ml.messagedetails_id

But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.

Explain:

+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key      | key_len | ref           | rows      | Extra       |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
|  1 | SIMPLE      | ml    | ALL  | hash_idx      | NULL     | NULL    | NULL          | 401798409 |             |
|  1 | SIMPLE      | ml1   | ref  | hash_idx      | hash_idx | 9       | skryb.ml.hash |         1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+

Answer 1

You can find your duplicates like this

SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;

If it is still too long, add a condition to split the request on an indexed field :

WHERE messagedetails_id < 100000

Answer 2

Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:

First run the following query

 CREATE TABLE duplicate_hashes SELECT * FROM ( SELECT hash , GROUP_CONCAT( id ) AS ids, COUNT(*) AS cnt, COUNT(DISTINCT messagedetails_id) AS cnt_message_details, GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1 ) tmp WHERE cnt > cnt_message_details

This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id

 where ml1.hash = ml.hash and ml1.messagedetails_id!=ml.messagedetails_id

Use a script to check each record of the duplicate_hashes table

Estimate/speedup huge table self-join on mysql

Question

2 answers

solution1
0 ACCPTED 2013-01-20 12:22:39

solution2
0 2013-01-21 10:25:12

Estimate/speedup huge table self-join on mysql

Question

2 answers

solution1 0 ACCPTED 2013-01-20 12:22:39

solution2 0 2013-01-21 10:25:12

solution1
0 ACCPTED 2013-01-20 12:22:39

solution2
0 2013-01-21 10:25:12