Optimization of 'GROUP BY'-Query, eliminate 'Using where; Using temporary; Using filesort'

Question

I'm faced with a MySQL issue I cannot seem to resolve. In order to be able quickly execute a GROUP BY query for reporting purposes, I have already denormalized a couple of tables into the following (the table is maintained by triggers on the other tables, I have made my peace with that):

DROP TABLE IF EXISTS stats;
CREATE TABLE stats (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `datetime` datetime NOT NULL,
  `datetime_hour` datetime NOT NULL,
  `datetime_day` datetime NOT NULL,
  `step_id` int(11) NOT NULL,
  `check_id` int(11) NOT NULL,
  `probe_id` int(11) NOT NULL,

  `execution_step_id` int(11) NOT NULL,

  `value_of_interest` int(11) DEFAULT NULL,
  `internal` tinyint(1) NOT NULL DEFAULT '0',

  PRIMARY KEY (`id`),
  UNIQUE KEY `index_stats_on_execution_step_id` (`execution_step_id`),

  CONSTRAINT `stats_step_id_fk` FOREIGN KEY (`step_id`) REFERENCES `steps` (`id`) ON DELETE CASCADE,
  CONSTRAINT `stats_check_id_fk` FOREIGN KEY (`check_id`) REFERENCES `checks` (`id`) ON DELETE CASCADE,
  CONSTRAINT `stats_probe_id_fk` FOREIGN KEY (`probe_id`) REFERENCES `probes` (`id`) ON DELETE CASCADE,
  CONSTRAINT `stats_execution_step_id_fk` FOREIGN KEY (`execution_step_id`) REFERENCES `execution_steps` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Whatever indexes I put on the table however, the following query will still end up having an explain with Using where; Using temporary; Using filesort Using where; Using temporary; Using filesort Using where; Using temporary; Using filesort or any combination of them (which are all causing the query to run with unacceptable performance):

SELECT
  datetime_day,
  step_id,
  CAST(AVG(value_of_interest) AS UNSIGNED) AS value_of_interest
FROM
  stats
WHERE
  check_id = 78
  AND probe_id = 1
  AND (datetime_day >= '2014-03-28 15:58:00' AND datetime_day <= '2014-10-28 15:58:00')
  AND (internal = 0)
GROUP BY
  datetime_day, step_id
ORDER BY
  datetime_day, step_id

What indexes do I need to set in the table definition and/or how do I need to modify my query in order for this to execute with a reasonable query execution plan?

Environment Specs:

Fedora release 19 (Schrödinger's Cat)
mysql Ver 15.1 Distrib 5.5.34-MariaDB, for Linux (x86_64) using readline 5.1
6G RAM, 30M Rows

Thanks a lot for your help!

PS: First time poster, sorry for any violations of best practices. I'm happy to learn...

EDIT:

One of the answers suggests to

ALTER TABLE `stats` ADD INDEX newindex (check_id, probe_id, internal, datetime_day, step_id);

which improves the situation a bit. I already tried this index before and got the following result:

+------+-------------+---------------------------+-------+---------------+----------+---------+------+--------+------------------------------------+
| id   | select_type | table                     | type  | possible_keys | key      | key_len | ref  | rows   | Extra                              |
+------+-------------+---------------------------+-------+---------------+----------+---------+------+--------+------------------------------------+
|    1 | SIMPLE      | stats                     | range | newindex      | newindex | 17      | NULL | 605682 | Using index condition; Using where |
+------+-------------+---------------------------+-------+---------------+----------+---------+------+--------+------------------------------------+

But shouldn't there be a way to have the query executed with 'Loose / Tight Index Scan' as mentioned in link ? I can't seem to get it to work, though and I'm not sure I understand the mentioned article correctly.

Answer 1

You have 600K rows to scan, so it cannot run instantly.

Why do you need CAST(AVG(value_of_interest) AS UNSIGNED) ? Can it be avoided, perhaps by cleansing the data before inserting?

This index would make it "Using index", which would make it faster. But, if this is not your only query, then it seems silly to add it.

INDEX newindex (check_id, probe_id, internal, datetime_day, step_id, value_of_interest)

Is there some reason for the odd start/end time? (15:58:00)

The 'real' solution for summarizing Data Warehouse table is to build and maintain "Summary table(s)". For the query in question, such a table would have check_id, probe_id, internal, step_id, datetime_hour, SUM(value_of_interest), COUNT(*). The first 5 would be the PRIMARY KEY. You would add another row to the table each hour. The report (for hours, days, weeks, months) would get the AVG by doing SUM(sums)/SUM(counts).

More discussion in my Summary Table blog .

Answer 2

Order by clauses are notorious for causing slower performance in queries. However, that being said, having better indexes to better match your criteria and grouping clauses will help.

I would suggest a composite index (on multiple fields) as

( check_id, probe_id, internal, datetime_day, step_id )

This way, your WHERE clause is optimized, and then your last two columns both match the group/order clauses to optimize that.

Optimization of 'GROUP BY'-Query, eliminate 'Using where; Using temporary; Using filesort'

Question

2 answers

solution1
1 ACCPTED 2015-05-06 21:47:59

solution2
0 2015-05-06 13:05:54

Optimization of 'GROUP BY'-Query, eliminate 'Using where; Using temporary; Using filesort'

Question

2 answers

solution1 1 ACCPTED 2015-05-06 21:47:59

solution2 0 2015-05-06 13:05:54

solution1
1 ACCPTED 2015-05-06 21:47:59

solution2
0 2015-05-06 13:05:54