简体   繁体   中英

MySQL WHERE EXISTS subquery performance between 5.7 and 8.0

I have an animals table (~2.7m records) and a breeds table (~2.7m records) that have a one to many relationship (one animal can have multiple breeds). I'm trying to query all distinct breeds for a specific species. As I'm not a SQL expert, my initial thought was to go with a simple SELECT DISTINCT breed... JOIN , but this query took about 10 seconds which seemed much longer than I'd expect. So I changed this to a SELECT DISTINCT... WHERE EXISTS subquery and it executed in about 100ms in 5.7, which is much more reasonable. But now I'm migrating to MySQL 8 and this exact query takes anywhere from 10-30 seconds. Here are the table definitions:

CREATE TABLE `animals` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(150) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  `species` varchar(50) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `sex` enum('Male','Female') CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `dob` date DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `animals_name_index` (`name`),
  KEY `animals_dob_index` (`dob`),
  KEY `animals_sex_index` (`sex`),
  KEY `animals_species_index` (`species`,`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=2807152 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_unicode_ci
CREATE TABLE `animal_breeds` (
  `animal_id` int unsigned DEFAULT NULL,
  `breed` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`),
  KEY `animal_breeds_breed_animal_id_index` (`breed`,`animal_id`) USING BTREE,
  CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Here is the query I'm running:

SELECT SQL_NO_CACHE * 
FROM
    (
        SELECT DISTINCT `breed` 
        FROM `animal_breeds`
    ) AS `subQuery`
WHERE
    EXISTS (
        SELECT `breed`
        FROM `animal_breeds`
        INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id`
        WHERE `animals`.`species` = 'Dog'AND `animal_breeds`.`breed` = `subQuery`.`breed`
    );

Here are the two EXPLAIN statements from 5.7 and 8.0

MySQL 5.7

284 rows in set, 1 warning (0.02 sec)

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> NULL ALL NULL NULL NULL NULL 7775 100.00 Using where
3 DEPENDENT SUBQUERY animal_breeds NULL ref animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index animal_breeds_breed_animal_id_index 1022 allBreeds.breed 348 100.00 Using where; Using index
3 DEPENDENT SUBQUERY animals NULL eq_ref PRIMARY,animals_species_index PRIMARY 4 animal_breeds.animal_id 1 50.00 Using where
2 DERIVED animal_breeds NULL range animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index animal_breeds_breed_animal_id_index 1022 NULL 7775 100.00 Using index for group-by

MySQL 8.0.27

284 rows in set, 1 warning (27.92 sec)

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> NULL ALL NULL NULL NULL NULL 7776 100.00 NULL
1 PRIMARY <subquery3> NULL eq_ref <auto_distinct_key> <auto_distinct_key> 1022 allBreeds.breed 1 100.00 NULL
3 MATERIALIZED animals NULL ref PRIMARY,animals_species_index animals_species_index 153 const 1390666 100.00 Using index
3 MATERIALIZED animal_breeds NULL ref animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index animal_breeds_animal_id_breed_unique 5 animals.id 1 100.00 Using index
2 DERIVED animal_breeds NULL range animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index animal_breeds_breed_animal_id_index 1022 NULL 7776 100.00 Using index for group-by

Lastly, both of these databases are using the base docker image with no changes to the configuration. Although the query still runs poorly on an VPS running MySQL 8 with some tweaked settings. I also read through a thread about someone having a similar problem but the comments/answer didn't seem to help in my case.

Any help would be much appreciated!

EDIT:

Here is the execution plan for the SELECT DISTINCT... JOIN :

SELECT DISTINCT ab.breed
FROM animal_breeds ab
INNER JOIN animals a on a.id=ab.animal_id
WHERE a.species='Dog'

MySQL 5.7

284 rows in set (25.27 sec)

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE a NULL ref PRIMARY,animals_species_index,id_species animals_species_index 153 const 1385271 100.00 Using index; Using temporary
1 SIMPLE ab NULL ref animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index animal_breeds_animal_id_breed_unique 5 a.id 1 100.00 Using index

MySQL 8.0

284 rows in set (29.45 sec)

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE a NULL ref PRIMARY,animals_species_index,id_species animals_species_index 153 const 1390666 100.00 Using index; Using temporary
1 SIMPLE ab NULL ref animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index animal_breeds_animal_id_breed_unique 5 a.id 1 100.00 Using index
SELECT ab.breed
FROM animal_breeds ab
INNER JOIN animals a on a.id=ab.animal_id
WHERE a.species='Dog'

MySQL 5.7

2722722 rows in set (26.69 sec)

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE a ref PRIMARY,animals_species_index,id_species animals_species_index 153 const 1385271 100.00 Using index
1 SIMPLE ab ref animal_breeds_animal_id_breed_unique animal_breeds_animal_id_breed_unique 5 a.id 1 100.00 Using index

MySQL 8.0

2722722 rows in set (32.49 sec)

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE a NULL ref PRIMARY,animals_species_index,id_species animals_species_index 153 const 1390666 100.00 Using index
1 SIMPLE ab NULL ref animal_breeds_animal_id_breed_unique animal_breeds_animal_id_breed_unique 5 a.id 1 100.00 Using index

Try to write query without inner join and start with a table that contains columns from where conditions. Here is the one of possible variants:

SELECT DISTINCT ab.breed
FROM animals a
LEFT JOIN animal_breeds ab on a.id = ab.animal_id
WHERE a.species = 'Dog'

Consider this:

  • Change the table name animals to pets . (I think of "Fido" as a pet and "dog" as an animal. Because of this, it took me a long time to figure out the schema.)
  • Move species to the other table, now called pet_species . Though I cannot imagine a "cat" breed being called "retriever", it does make sense that (species, breed) is a hierarchical pair of terms.
  • Another confusion: Technically speaking a "dog" is "Canis familiaris", which is [technically] two terms 'genus' and 'species'.

Moving species to the other table will lead to some changes to the queries. You could have it in both tables, though DB purists say that redundant information is a "no-no". I have not thought of a compromise between the two stands.

Filtering animals before joining it to breeds will improve performance (10x faster in some cases):

SELECT  DISTINCT ab.breed
FROM    animal_breeds ab
WHERE   ab.animal_id IN (
                     SELECT a.id
                     FROM animals a
                     WHERE a.species = 'Dog');

If a breed is only ever associated with one species, you can greatly speed up this query by finding all the unique breeds in the animal_breeds table by using GROUP BY and selecting (for example) the minimum animal_id for each breed (with your index this is very efficient). This result can then be joined to the animals table on that animal_id value and filtered based on the animal species being Dog :

WITH breeds AS (
    SELECT breed, MIN(animal_id) AS animal_id
    FROM animal_breeds
    GROUP BY breed
)
SELECT breed
FROM breeds
JOIN animals ON animals.id = breeds.animal_id AND animals.species = 'Dog'

I tested this on my MySQL 8 server with a dataset of 1M rows with 3 species and 5 breeds per species and it ran at least 200x faster than any of the other proposed solutions (~15ms vs 3sec).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM