简体   繁体   中英

Optimizing Large MySQL Query

I'm trying to optimize a query that takes way too long to run as it is. It seems to be stuck in Sending Data a lot and takes about half an hour to run.

$campaignIDs = "31,36,37,40,41,42,43,50,51,62,64,65,66,67,68,69,84,338,339,355,431,505,530,549,563,694,752,754,755,760,769,772,777,798,799,800,806,816,821,855,856,945,989,1007,1030,1032,1047,1052,1054,1066,1182,1268,1281,1298,1301,1317,1348,1447,1461,1471,1589,1602,1604,1615,1622,1650,1652,1709";

SELECT Email, Type, CampaignID 
FROM Refer 
WHERE (Type = 'V' OR Type = 'C') 
  AND (EmailDomain = 'yahoo.com') 
  AND (ListID = 1) 
  AND CampaignID IN ($campaignIDs) 
  AND Date >= DATE_SUB(NOW(), INTERVAL 90 DAY)

Here's what the Refer table looks like:

+-------------+------------------+------+-----+-------------------+----------------+
| Field       | Type             | Null | Key | Default           | Extra          |
+-------------+------------------+------+-----+-------------------+----------------+
| ID          | int(10) unsigned | NO   | PRI | NULL              | auto_increment |
| CampaignID  | int(10) unsigned | NO   | MUL | NULL              |                |
| Type        | char(1)          | NO   | MUL | NULL              |                |
| Date        | timestamp        | NO   |     | CURRENT_TIMESTAMP |                |
| IP          | varchar(16)      | NO   |     | NULL              |                |
| Useragent   | varchar(200)     | YES  |     | NULL              |                |
| Referrer    | varchar(200)     | YES  |     | NULL              |                |
| Email       | varchar(200)     | NO   | MUL | NULL              |                |
| EmailDomain | varchar(200)     | YES  | MUL | NULL              |                |
| FolderName  | varchar(200)     | NO   |     | NULL              |                |
| ListID      | int(10) unsigned | NO   | MUL | 1                 |                |
+-------------+------------------+------+-----+-------------------+----------------+

Here are the indexes:

+-------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| refer |          0 | PRIMARY        |            1 | ID          | A         |   148581841 |     NULL | NULL   |      | BTREE      |         |
| refer |          1 | id_email       |            1 | Email       | A         |    18572730 |     NULL | NULL   |      | BTREE      |         |
| refer |          1 | id_type        |            1 | Type        | A         |          19 |     NULL | NULL   |      | BTREE      |         |
| refer |          1 | id_emaildomain |            1 | EmailDomain | A         |          19 |     NULL | NULL   | YES  | BTREE      |         |
| refer |          1 | id_campaignid  |            1 | CampaignID  | A         |          19 |     NULL | NULL   |      | BTREE      |         |
| refer |          1 | id_listid      |            1 | ListID      | A         |          19 |     NULL | NULL   |      | BTREE      |         |
| refer |          1 | id_emailtype   |            1 | Email       | A         |    24763640 |     NULL | NULL   |      | BTREE      |         |
| refer |          1 | id_emailtype   |            2 | Type        | A         |    37145460 |     NULL | NULL   |      | BTREE      |         |
| refer |          1 | idx_cidtype    |            1 | CampaignID  | A         |          19 |     NULL | NULL   |      | BTREE      |         |
| refer |          1 | idx_cidtype    |            2 | Type        | A         |          19 |     NULL | NULL   |      | BTREE      |         |
+-------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

Here's the output for EXPLAIN SELECT:

+----+-------------+-------+-------+------------------------------------------------------------+---------------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys                                              | key           | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+------------------------------------------------------------+---------------+---------+------+---------+-------------+
|  1 | SIMPLE      | Refer | range | id_type,id_emaildomain,id_campaignid,id_listid,idx_cidtype | id_campaignid | 4       | NULL | 3605121 | Using where |
+----+-------------+-------+-------+------------------------------------------------------------+---------------+---------+------+---------+-------------+

There are about 150M rows in the table.

Is there anything I can do to optimize the query in question? Do I need to add indexes or something? How can I make things better?

You could try the following index to tune that statement

ALTER TABLE refer
  ADD INDEX so_suggested (EmailDomain, ListID, Date);

This is just my first thought.

You can also add CampaignID and Type to make it more efficient--if they are selective. If you add both, you could even try adding Email to make it a covering index .

However, the number of indexes on that table is rather high (eight). Two of them are redundant (id_email, id_campaignid) because there are other ones that start with the same column (id_emailtype, idx_cidtype).

Please note that (in principle) one table access uses only one index. Your query has only one table access (no sub-queries, joins, UNION or so) therefore it can use one index only. Hence, you need one index that supports as much as possible from your where clause.

Please note also that the order of columns in that index matters a lot. I have added the ones with exact match first ( EmailDomain , ListID ), followed by the one that uses a in-equality operator ( Date )--assuming that the clause one Date is still rather selective. Everything that follows the in-equality operation is just a filter in the index--if needed you can add the IN lists here.

Ad

Just in case you would like to learn more about database indexing: Have a look at my free eBook on database indexing .

There's little scope here for tuning the query, but you could proabably make it go a lot faster by tuning the database schema - the trick is to identify a potential index which is as specific as possible.

eg

AND Date >= DATE_SUB(NOW(), INTERVAL 90 DAY)

suggests that an index on 'Date' might help - but only if your data is well spread over at least 4 years.

In practice and particularly when you only need to target specific queries, compound indexes are a good idea - but the best choice of index depends not only on the size and shape of your data but also the other queries which you run on your database.

Looking at your query:

WHERE (Type = 'V' OR Type = 'C') 
  AND (EmailDomain = 'yahoo.com') 
  AND (ListID = 1) 
  AND CampaignID IN ($campaignIDs) 
  AND Date >= DATE_SUB(NOW(), INTERVAL 90 DAY)

You could simply add an index on (type, emailDomain, ListId, CampaignId and Date) however I suspect that CampaignId and Date have the greatest cardinality and should therefore appear at the front of the index - the index should be ordered on the ratio of the cardinality in the input dataset (the table) to the output of the query. eg if you routinely ran a query with:

 AND Date >= DATE_SUB(NOW(), INTERVAL 90000 DAY)

Then you're not going to get as much benefit from having Date at the front of the index. Similarly, it looks as though Type has a very limited set of values and should appear later in the index than CampaignId (assuming that you only look at a relatively small number of CampaignIds at any time).

To get an estimate of the cardinality, consider:

 SELECT COUNT(records_of_type)/SUM(records_of_type)
 FROM (SELECT afield, COUNT(*) AS records_of_type
   FROM atable)

(high values are more selective and should normally appear at the front of an index).

But do bear in mind that you will occasionally see functional dependencies across columns.

Ordering your index field order by cardinality does not decrease the number of index nodes the DBMS must visit to satisfy the query, but should result in a decrease to the number of disk I/O operations needed.

However its much more important to identify which fields which appear in indexes before worrying about the order.

Could try a couple of different approaches for this.

One thing you could try:

$date = mysql_query("SELECT DATE_SUB(NOW(), INTERVAL 90 DAY) AS date");

SELECT * FROM (
  SELECT Email, Type, CampaignID 
  FROM Refer 
  WHERE (Type = 'V' OR Type = 'C') 
    AND (EmailDomain = 'yahoo.com') 
    AND (ListID = 1) 
  )
  WHERE Date >= $date
    AND CampaignID IN ($campaignIDs) 

Index this query on (Type EmailDomain ListID) and you should see a significant performance gain. You can also play with the ordering of the index (but make sure the query matches). The goal of this is to take the fast part of your query, and run it against the larger number of records, and then take the slow part of your query and run it against this much smaller set.

You might need to make a temporary table to get sql to do it; I didn't have to for my test set however. Note also that I took the function call out of the big slow query and turned it into a constant.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM