I have a large dataset (which is going to keep on growing!) where the data being read in bulk is stored with a DATE
column, as all rows in any of the core data tables belong to a specific day (context is analytics/reporting).
A lot of the views require data on a per month rather than per day detail level, and I'm aggregating the data as needed via SQL (SUM, AVG, etc).
This also means I'm grouping data by YEAR()
and MONTH()
, which cannot use the index on the DATE
column and results in a Use temporary
and Use filesort
from the query executor.
Is the best solution here to split the DATE
column into 3 separate columns for year, month and day? Or to retain the DATE
column (constraint, sorting, etc) and have a "yearmonth" (yyyymm) column which is also indexed? I don't like duplicating data but I'm just not 100% on what would be the best practice for this scenario.
I think the best way in terms of performance with GROUP
-ing and SELECT
-ing on month and date columns is to add a MONTH
and YEAR
column to the data. The speed you gain by proper index usage will outnumber the pain of some more / duplicated data.
Note that there is a YEAR
datatype in MySQL.
Make sure to use B-TREE
indices on month
and year
column (not HASH
).
Do not split a DATE into component parts. The difficulties outweighs the presumed benefit.
Use Summary Tables to avoid lengthy analytics/reporting. See my blog on such. Roughly speaking, every night you would calculate some subtotals and counts for the past day, and put these in a "Summary Table". Analytics would run much faster against that table than against the "Fact" table.
For AVG, be sure to store SUM() and COUNT(*), the compute (in the Report) SUM(sums) / SUM(counts) as Average
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.