I have an animals
table (~2.7m records) and a breeds
table (~2.7m records) that have a one to many relationship (one animal can have multiple breeds). I'm trying to query all distinct breeds for a specific species. As I'm not a SQL expert, my initial thought was to go with a simple SELECT DISTINCT breed... JOIN
, but this query took about 10 seconds which seemed much longer than I'd expect. So I changed this to a SELECT DISTINCT... WHERE EXISTS
subquery and it executed in about 100ms in 5.7, which is much more reasonable. But now I'm migrating to MySQL 8 and this exact query takes anywhere from 10-30 seconds. Here are the table definitions:
CREATE TABLE `animals` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`species` varchar(50) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`sex` enum('Male','Female') CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`dob` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `animals_name_index` (`name`),
KEY `animals_dob_index` (`dob`),
KEY `animals_sex_index` (`sex`),
KEY `animals_species_index` (`species`,`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=2807152 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_unicode_ci
CREATE TABLE `animal_breeds` (
`animal_id` int unsigned DEFAULT NULL,
`breed` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`),
KEY `animal_breeds_breed_animal_id_index` (`breed`,`animal_id`) USING BTREE,
CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Here is the query I'm running:
SELECT SQL_NO_CACHE *
FROM
(
SELECT DISTINCT `breed`
FROM `animal_breeds`
) AS `subQuery`
WHERE
EXISTS (
SELECT `breed`
FROM `animal_breeds`
INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id`
WHERE `animals`.`species` = 'Dog'AND `animal_breeds`.`breed` = `subQuery`.`breed`
);
Here are the two EXPLAIN
statements from 5.7 and 8.0
MySQL 5.7
284 rows in set, 1 warning (0.02 sec)
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 7775 | 100.00 | Using where |
3 | DEPENDENT SUBQUERY | animal_breeds | NULL | ref | animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index | animal_breeds_breed_animal_id_index | 1022 | allBreeds.breed | 348 | 100.00 | Using where; Using index |
3 | DEPENDENT SUBQUERY | animals | NULL | eq_ref | PRIMARY,animals_species_index | PRIMARY | 4 | animal_breeds.animal_id | 1 | 50.00 | Using where |
2 | DERIVED | animal_breeds | NULL | range | animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index | animal_breeds_breed_animal_id_index | 1022 | NULL | 7775 | 100.00 | Using index for group-by |
MySQL 8.0.27
284 rows in set, 1 warning (27.92 sec)
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 7776 | 100.00 | NULL |
1 | PRIMARY | <subquery3> | NULL | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 1022 | allBreeds.breed | 1 | 100.00 | NULL |
3 | MATERIALIZED | animals | NULL | ref | PRIMARY,animals_species_index | animals_species_index | 153 | const | 1390666 | 100.00 | Using index |
3 | MATERIALIZED | animal_breeds | NULL | ref | animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index | animal_breeds_animal_id_breed_unique | 5 | animals.id | 1 | 100.00 | Using index |
2 | DERIVED | animal_breeds | NULL | range | animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index | animal_breeds_breed_animal_id_index | 1022 | NULL | 7776 | 100.00 | Using index for group-by |
Lastly, both of these databases are using the base docker image with no changes to the configuration. Although the query still runs poorly on an VPS running MySQL 8 with some tweaked settings. I also read through a thread about someone having a similar problem but the comments/answer didn't seem to help in my case.
Any help would be much appreciated!
EDIT:
Here is the execution plan for the SELECT DISTINCT... JOIN
:
SELECT DISTINCT ab.breed
FROM animal_breeds ab
INNER JOIN animals a on a.id=ab.animal_id
WHERE a.species='Dog'
MySQL 5.7
284 rows in set (25.27 sec)
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | a | NULL | ref | PRIMARY,animals_species_index,id_species | animals_species_index | 153 | const | 1385271 | 100.00 | Using index; Using temporary |
1 | SIMPLE | ab | NULL | ref | animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index | animal_breeds_animal_id_breed_unique | 5 | a.id | 1 | 100.00 | Using index |
MySQL 8.0
284 rows in set (29.45 sec)
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | a | NULL | ref | PRIMARY,animals_species_index,id_species | animals_species_index | 153 | const | 1390666 | 100.00 | Using index; Using temporary |
1 | SIMPLE | ab | NULL | ref | animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index | animal_breeds_animal_id_breed_unique | 5 | a.id | 1 | 100.00 | Using index |
SELECT ab.breed
FROM animal_breeds ab
INNER JOIN animals a on a.id=ab.animal_id
WHERE a.species='Dog'
MySQL 5.7
2722722 rows in set (26.69 sec)
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | a | ref | PRIMARY,animals_species_index,id_species | animals_species_index | 153 | const | 1385271 | 100.00 | Using index | |
1 | SIMPLE | ab | ref | animal_breeds_animal_id_breed_unique | animal_breeds_animal_id_breed_unique | 5 | a.id | 1 | 100.00 | Using index |
MySQL 8.0
2722722 rows in set (32.49 sec)
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | a | NULL | ref | PRIMARY,animals_species_index,id_species | animals_species_index | 153 | const | 1390666 | 100.00 | Using index |
1 | SIMPLE | ab | NULL | ref | animal_breeds_animal_id_breed_unique | animal_breeds_animal_id_breed_unique | 5 | a.id | 1 | 100.00 | Using index |
Try to write query without inner join and start with a table that contains columns from where conditions. Here is the one of possible variants:
SELECT DISTINCT ab.breed
FROM animals a
LEFT JOIN animal_breeds ab on a.id = ab.animal_id
WHERE a.species = 'Dog'
Consider this:
animals
to pets
. (I think of "Fido" as a pet and "dog" as an animal. Because of this, it took me a long time to figure out the schema.) species
to the other table, now called pet_species
. Though I cannot imagine a "cat" breed being called "retriever", it does make sense that (species, breed) is a hierarchical pair of terms.Moving species to the other table will lead to some changes to the queries. You could have it in both tables, though DB purists say that redundant information is a "no-no". I have not thought of a compromise between the two stands.
Filtering animals
before joining it to breeds
will improve performance (10x faster in some cases):
SELECT DISTINCT ab.breed
FROM animal_breeds ab
WHERE ab.animal_id IN (
SELECT a.id
FROM animals a
WHERE a.species = 'Dog');
If a breed is only ever associated with one species, you can greatly speed up this query by finding all the unique breeds in the animal_breeds
table by using GROUP BY
and selecting (for example) the minimum animal_id
for each breed (with your index this is very efficient). This result can then be joined to the animals
table on that animal_id
value and filtered based on the animal species being Dog
:
WITH breeds AS (
SELECT breed, MIN(animal_id) AS animal_id
FROM animal_breeds
GROUP BY breed
)
SELECT breed
FROM breeds
JOIN animals ON animals.id = breeds.animal_id AND animals.species = 'Dog'
I tested this on my MySQL 8 server with a dataset of 1M rows with 3 species and 5 breeds per species and it ran at least 200x faster than any of the other proposed solutions (~15ms vs 3sec).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.