简体繁体 English

使用MySQL按年和月分组，同时利用索引并避免临时/文件排序

[英]Grouping by year and month with MySQL while leveraging indexes and avoiding temporary/filesort

原文 2015-04-16 12:01:14 2 2 mysql/ date/ query-optimization/ aggregate

I have a large dataset (which is going to keep on growing!) where the data being read in bulk is stored with a DATE column, as all rows in any of the core data tables belong to a specific day (context is analytics/reporting). 我有一个很大的数据集（它将继续增长！），其中批量读取的数据存储在DATE列中，因为任何核心数据表中的所有行都属于特定的一天（上下文是分析/报告））。

A lot of the views require data on a per month rather than per day detail level, and I'm aggregating the data as needed via SQL (SUM, AVG, etc). 许多视图需要每月而不是每天的详细信息级别的数据，我正在根据需要通过SQL（SUM，AVG等）聚合数据。

This also means I'm grouping data by YEAR() and MONTH() , which cannot use the index on the DATE column and results in a Use temporary and Use filesort from the query executor. 这也意味着我YEAR()和MONTH()对数据进行分组，这不能使用DATE列上的索引，并导致查询执行程序产生Use temporary和Use filesort 。

Is the best solution here to split the DATE column into 3 separate columns for year, month and day? 这里是将DATE列分为年，月和日的3个单独列的最佳解决方案吗？ Or to retain the DATE column (constraint, sorting, etc) and have a "yearmonth" (yyyymm) column which is also indexed? 还是保留DATE列（约束，排序等），并保留一个“ yearmonth”（yyyymm）列，该列也已建立索引？ I don't like duplicating data but I'm just not 100% on what would be the best practice for this scenario. 我不喜欢复制数据，但是对于这种情况的最佳实践，我并不是100％。

2 个解决方案

I think the best way in terms of performance with GROUP -ing and SELECT -ing on month and date columns is to add a MONTH and YEAR column to the data. 我认为，在月和日期列上使用GROUP -ing和SELECT -ing的最佳方式是在数据中添加MONTH和YEAR列。 The speed you gain by proper index usage will outnumber the pain of some more / duplicated data. 通过正确使用索引所获得的速度将超过更多/重复数据的痛苦。

Note that there is a YEAR datatype in MySQL. 注意，MySQL中有一个YEAR数据类型。

Make sure to use B-TREE indices on month and year column (not HASH ). 确保在month和year列上使用B-TREE索引（而不是HASH ）。

Do not split a DATE into component parts. 请勿将DATE分为多个组成部分。 The difficulties outweighs the presumed benefit. 困难大于预期的利益。

Use Summary Tables to avoid lengthy analytics/reporting. 使用摘要表可以避免冗长的分析/报告。 See my blog on such. 请参阅我的博客。 Roughly speaking, every night you would calculate some subtotals and counts for the past day, and put these in a "Summary Table". 粗略地说，每天晚上您都将计算过去一天的一些小计和计数，并将它们放在“汇总表”中。 Analytics would run much faster against that table than against the "Fact" table. 与该表相比，针对该表的分析运行速度要快得多。

For AVG, be sure to store SUM() and COUNT(*), the compute (in the Report) SUM(sums) / SUM(counts) as Average . 对于AVG，请确保存储SUM（）和COUNT（*），并将计算（在报告中）的SUM(sums) / SUM(counts) as Average 。