简体   繁体   中英

Optimizing Self-Join using filesort

I have this query:

SELECT DISTINCT 
       t1.`signature_id` AS id1, 
       t2.`signature_id` AS id2, 
       COUNT(DISTINCT t3.serial) AS weight 
FROM `gc_con_sig` AS t1 
INNER JOIN `gc_con_sig` AS t2 
        ON ((t1.`signature_id` != t2.`signature_id`) 
            AND (t1.`petition_id` = t2.`petition_id`))
INNER JOIN `wtp_data_petitions` AS t3 
        ON (t3.`serial` = t1.`petition_serial`)
GROUP BY t1.`signature_id`, t2.`signature_id`
HAVING weight > 0;

It essentially get the permutations of signature_ids, and the number of petitions they've both signed (weight).

That I'm trying to run against this table (gc_con_sig):

`petition_id` varchar(64) NOT NULL DEFAULT '' COMMENT 'Petition ID defined by API',
  `signature_id` varchar(34) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `petition_serial` int(11) DEFAULT NULL,
  KEY `signature_id` (`signature_id`),
  KEY `petition_id` (`petition_id`),
  KEY `signature_petition_idx` (`signature_id`,`petition_id`),
  KEY `pcidx` (`petition_id`,`signature_id`),
  KEY `sig_pet_ser_idx` (`petition_serial`)

This is the explain I get:

  +----+-------------+-------+--------+--------------------------------------------------------+---------+---------+------------------------+--------+----------------------------------------------+
  | id | select_type | table | type   | possible_keys                                          | key     | key_len | ref                    | rows   | Extra                                        |
  +----+-------------+-------+--------+--------------------------------------------------------+---------+---------+------------------------+--------+----------------------------------------------+
  |  1 | SIMPLE      | t1    | ALL    | petition_id,pcidx,sig_pet_ser_idx                      | NULL    | NULL    | NULL                   | 200659 | Using where; Using temporary; Using filesort |
  |  1 | SIMPLE      | t3    | eq_ref | PRIMARY                                                | PRIMARY | 4       | wtp.t1.petition_serial |      1 | Using index                                  |
  |  1 | SIMPLE      | t2    | ref    | petition_id,pcidx                                      | pcidx   | 194     | wtp.t1.petition_id     |   5016 | Using where; Using index                     |
  +----+-------------+-------+--------+--------------------------------------------------------+---------+---------+------------------------+--------+----------------------------------------------+

I've optimized the various mysql configurations mysqltuner has told me to, but this query doesn't run (at least within an hour) on a machine with 17GB ram (12GB allocated to mysql).

Any ideas?

Can signatures be on multiple petitions? Can serial be NULL ?

Assuming the answers are "no" to both questions, you might try:

SELECT t1.`signature_id` AS id1, t2.`signature_id` AS id2,
       COUNT(*) AS weight 
FROM `gc_con_sig` t1 INNER JOIN
     `gc_con_sig` t2
     ON (t1.`signature_id` != t2.`signature_id`) AND
        (t1.`petition_id` = t2.`petition_id`)
GROUP BY t1.`signature_id`, t2.`signature_id`;

The count(distinct serial) is counting the non-NULL values in the field. If all values are not NULL and there are no duplicates, then this is equivalent to count(*) .

The having clause is not needed because the on clause basically guarantees that there is at least one match.

And, finally, select distinct is never needed when you are using a group by correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM