简体   繁体   中英

View or Stored Procedure for an Aggregate Query?

  • I currently have a table with 600,000,000 rows.
  • I want to reduce the number of rows, for my reporting application, by performing a Daily Average on the data with a Group By clause.

The smaller subset of data (99% reduction) will then be used, from my reporting application.

As this will be 'built' on a daily basis; what is the best tool - Stored Procedure, View or something else?

Build and maintain a Summary table. Initially you would need to run a big GROUP BY to collect all the old data. After that a nightly job would compute COUNT(*) , SUM(...) , etc for the previous day.

Then the 'report' would run much faster against this new table.

The key for that table would include day (not date+time), plus a few columns that you may be needing for the report(s).

Blog with more details .

I find that the typical speedup is 10x; you might get 100x (99% reduction).

The best tool is a script that you run via cron (or perhaps MySQL EVENT ). It would simply do something like

INSERT INTO SummaryTable (dy, ..., ct, tot, ...)
SELECT DATE(datetime), ...,   -- key
       COUNT(*), SUM(..), ...   -- data
   FROM FactTable
   WHERE datetime >= CURDATE() - INTERVAL 1 DAY
     AND datetime  < CURDATE();

That one SQL statement may be all that is needed. Yes, it could be in a Stored Procedure, but that is not much different than having it directly in the nightly script.

In some cases it may be better to use INSERT ... ON DUPLICATE KEY UPDATE ... SELECT ... (but that gets messy).

When talking about "averages", consider the following:

  • A daily average can be computed each night: AVG(...) , but
  • A monthly average should probably be computed, not for daily averages, but from SUM(daily_sums) / SUM(daily_counts) . That is, the summary table probably needs COUNT(*) and SUM(...) .

To initially populate this summary table, I would write a one-time script to slowly walk through the 600M rows one day at a time. Sure, you could do it all at once, but the interference with everything else might be 'bad'.

Even better would be for the nightly script to include code to "pick up where it left off". This way, if the script fails to run some night it will repair the omission the next night. Or you can manually run it when you see a problem. And an extra run won't hurt anything.

While you are at it, think about other Summary Tables you might need. I typically find that 3-7 summary tables are needed for a Data Warehouse application. On the other hand, keep in mind that weekly and monthly summaries can be derived (efficiently enough) from a daily summary table. In a couple of cases, I had an Hourly summary table for one thing, then daily tables for different things.

600M rows is big. Will 'old' data be purged? Once you have the summary tables you need, will the 'old' data no longer be needed? Blog on using Partitioning for such .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM