[英]Grouping by year and month with MySQL while leveraging indexes and avoiding temporary/filesort
I have a large dataset (which is going to keep on growing!) where the data being read in bulk is stored with a DATE
column, as all rows in any of the core data tables belong to a specific day (context is analytics/reporting). 我有一个很大的数据集(它将继续增长!),其中批量读取的数据存储在DATE
列中,因为任何核心数据表中的所有行都属于特定的一天(上下文是分析/报告) )。
A lot of the views require data on a per month rather than per day detail level, and I'm aggregating the data as needed via SQL (SUM, AVG, etc). 许多视图需要每月而不是每天的详细信息级别的数据,我正在根据需要通过SQL(SUM,AVG等)聚合数据。
This also means I'm grouping data by YEAR()
and MONTH()
, which cannot use the index on the DATE
column and results in a Use temporary
and Use filesort
from the query executor. 这也意味着我YEAR()
和MONTH()
对数据进行分组,这不能使用DATE
列上的索引,并导致查询执行程序产生Use temporary
和Use filesort
。
Is the best solution here to split the DATE
column into 3 separate columns for year, month and day? 这里是将DATE
列分为年,月和日的3个单独列的最佳解决方案吗? Or to retain the DATE
column (constraint, sorting, etc) and have a "yearmonth" (yyyymm) column which is also indexed? 还是保留DATE
列(约束,排序等),并保留一个“ yearmonth”(yyyymm)列,该列也已建立索引? I don't like duplicating data but I'm just not 100% on what would be the best practice for this scenario. 我不喜欢复制数据,但是对于这种情况的最佳实践,我并不是100%。
I think the best way in terms of performance with GROUP
-ing and SELECT
-ing on month and date columns is to add a MONTH
and YEAR
column to the data. 我认为,在月和日期列上使用GROUP
-ing和SELECT
-ing的最佳方式是在数据中添加MONTH
和YEAR
列。 The speed you gain by proper index usage will outnumber the pain of some more / duplicated data. 通过正确使用索引所获得的速度将超过更多/重复数据的痛苦。
Note that there is a YEAR
datatype in MySQL. 注意,MySQL中有一个YEAR
数据类型。
Make sure to use B-TREE
indices on month
and year
column (not HASH
). 确保在month
和year
列上使用B-TREE
索引(而不是HASH
)。
Do not split a DATE into component parts. 请勿将DATE分为多个组成部分。 The difficulties outweighs the presumed benefit. 困难大于预期的利益。
Use Summary Tables to avoid lengthy analytics/reporting. 使用摘要表可以避免冗长的分析/报告。 See my blog on such. 请参阅我的博客 。 Roughly speaking, every night you would calculate some subtotals and counts for the past day, and put these in a "Summary Table". 粗略地说,每天晚上您都将计算过去一天的一些小计和计数,并将它们放在“汇总表”中。 Analytics would run much faster against that table than against the "Fact" table. 与该表相比,针对该表的分析运行速度要快得多。
For AVG, be sure to store SUM() and COUNT(*), the compute (in the Report) SUM(sums) / SUM(counts) as Average
. 对于AVG,请确保存储SUM()和COUNT(*),并将计算(在报告中)的SUM(sums) / SUM(counts) as Average
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.