Very simple AVG() aggregation query on MySQL server takes ridiculously long time

Question

I am using MySQL server via Amazon could service, with default settings. The table involved mytable is of InnoDB type and has about 1 billion rows. The query is:

select count(*), avg(`01`) from mytable where `date` = "2017-11-01";

Which takes almost 10 min to execute. I have an index on date . The EXPLAIN of this query is:

+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table         | type | possible_keys | key  | key_len | ref   | rows    | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
|  1 | SIMPLE      | mytable       | ref  | date          | date | 3       | const | 1411576 | NULL  |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+

The indexes from this table are:

+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable       |          0 | PRIMARY   |            1 | ESI         | A         |    60398679 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          0 | PRIMARY   |            2 | date        | A         |  1026777555 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          1 | lse_cd    |            1 | lse_cd      | A         |     1919210 |     NULL | NULL   | YES  | BTREE      |         |               |
| mytable       |          1 | zone      |            1 | zone        | A         |      732366 |     NULL | NULL   | YES  | BTREE      |         |               |
| mytable       |          1 | date      |            1 | date        | A         |    85564796 |     NULL | NULL   |      | BTREE      |         |               |
| mytable       |          1 | ESI_index |            1 | ESI         | A         |     6937686 |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

If I remove AVG() :

select count(*) from mytable where `date` = "2017-11-01";

It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other date s.

I don't have an index over 01 . Is it an issue? Why AVG() takes so long to compute? There must be something I didn't do properly.

Any suggestion is appreciated!

Answer 1

To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of esi , this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).

To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the innodb_page_size (16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*) ; and depending on your memory configuration, some of this data might not be cached and has to be read from disk.)

A solution to this is to include all used columns in the index (a "covering index"), eg create an index on date, 01 . Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg -operation), but it should still be a matter of seconds.

In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the avg for several columns at the same time, you would need a covering index on all of them, eg date, 01, 02, ..., 24 to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources.

To avoid the MySQL-limit of 16 columns per index , you could split it into two indexes (and two queries). Create eg the indexes date, 01, .., 12 and date, 13, .., 24 , then use

select * from (select `date`, avg(`01`), ..., avg(`12`) 
               from mytable where `date` = ...) as part1
cross join    (select avg(`13`), ..., avg(`24`) 
               from mytable where `date` = ...) as part2;

Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it.

If you only ever average over a single column, you could add 24 seperate indexes (on date, 01 , date, 02 , ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.

Since date is part of your primary key, you could also consider changing the primary key to date, esi . If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that eg use esi to locate rows), so it has to be considered carefully.

As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger).

Answer 2

For MyISAM tables, COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.

For example:

SELECT COUNT(*) FROM student;

https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html#function_count

If you add AVG() or something else, you lose this optimization

Very simple AVG() aggregation query on MySQL server takes ridiculously long time

Question

2 answers

solution1
3 ACCPTED 2018-03-21 20:41:00

solution2
0 2018-03-21 05:52:49

Very simple AVG() aggregation query on MySQL server takes ridiculously long time

Question

2 answers

solution1 3 ACCPTED 2018-03-21 20:41:00

solution2 0 2018-03-21 05:52:49

solution1
3 ACCPTED 2018-03-21 20:41:00

solution2
0 2018-03-21 05:52:49