简体   繁体   English

查看或存储过程以进行汇总查询?

[英]View or Stored Procedure for an Aggregate Query?

  • I currently have a table with 600,000,000 rows. 我目前有一个具有600,000,000行的表。
  • I want to reduce the number of rows, for my reporting application, by performing a Daily Average on the data with a Group By clause. 我想通过使用Group By子句对数据执行每日平均,来减少报表应用程序的行数。

The smaller subset of data (99% reduction) will then be used, from my reporting application. 然后,将使用我的报表应用程序中的较小数据子集(减少了99%)。

As this will be 'built' on a daily basis; 因为这将是每天“建立”的; what is the best tool - Stored Procedure, View or something else? 最好的工具是什么-存储过程,视图或其他?

Build and maintain a Summary table. 建立并维护摘要表。 Initially you would need to run a big GROUP BY to collect all the old data. 最初,您需要运行一个大的GROUP BY来收集所有旧数据。 After that a nightly job would compute COUNT(*) , SUM(...) , etc for the previous day. 之后,每夜工作将计算前一天的COUNT(*)SUM(...)等。

Then the 'report' would run much faster against this new table. 然后,针对此新表的“报告”将运行得更快。

The key for that table would include day (not date+time), plus a few columns that you may be needing for the report(s). 该表的键将包括日期(不是日期和时间),以及报告可能需要的几列。

Blog with more details . 更多详细信息的博客

I find that the typical speedup is 10x; 我发现典型的加速比是10倍; you might get 100x (99% reduction). 您可能会得到100倍的收益(减少了99%)。

The best tool is a script that you run via cron (or perhaps MySQL EVENT ). 最好的工具是您通过cron(或MySQL EVENT )运行的脚本。 It would simply do something like 它只会做类似的事情

INSERT INTO SummaryTable (dy, ..., ct, tot, ...)
SELECT DATE(datetime), ...,   -- key
       COUNT(*), SUM(..), ...   -- data
   FROM FactTable
   WHERE datetime >= CURDATE() - INTERVAL 1 DAY
     AND datetime  < CURDATE();

That one SQL statement may be all that is needed. 只需一条SQL语句即可。 Yes, it could be in a Stored Procedure, but that is not much different than having it directly in the nightly script. 是的,它可能在存储过程中,但是与直接在夜间脚本中存储没有太大区别。

In some cases it may be better to use INSERT ... ON DUPLICATE KEY UPDATE ... SELECT ... (but that gets messy). 在某些情况下,最好使用INSERT ... ON DUPLICATE KEY UPDATE ... SELECT ... (但是这样会比较混乱)。

When talking about "averages", consider the following: 在谈论“平均值”时,请考虑以下因素:

  • A daily average can be computed each night: AVG(...) , but 每天晚上可以计算出每日平均值: AVG(...) ,但是
  • A monthly average should probably be computed, not for daily averages, but from SUM(daily_sums) / SUM(daily_counts) . 可能应该计算每月平均值,而不是每日平均值,而应根据SUM(daily_sums) / SUM(daily_counts) That is, the summary table probably needs COUNT(*) and SUM(...) . 也就是说,汇总表可能需要COUNT(*)SUM(...)

To initially populate this summary table, I would write a one-time script to slowly walk through the 600M rows one day at a time. 为了最初填充此汇总表,我将编写一个一次性脚本来一次一天一次缓慢地遍历600M行。 Sure, you could do it all at once, but the interference with everything else might be 'bad'. 当然,您可以一次完成所有操作,但是对其他所有操作的干扰可能都是“不好的”。

Even better would be for the nightly script to include code to "pick up where it left off". 最好的做法是,每晚的脚本都包含代码以“从停止的地方开始”。 This way, if the script fails to run some night it will repair the omission the next night. 这样,如果脚本无法在某个晚上运行,它将在第二天晚上修复遗漏。 Or you can manually run it when you see a problem. 或者,您可以在遇到问题时手动运行它。 And an extra run won't hurt anything. 额外的奔跑不会伤害任何东西。

While you are at it, think about other Summary Tables you might need. 在使用它时,请考虑您可能需要的其他汇总表。 I typically find that 3-7 summary tables are needed for a Data Warehouse application. 我通常发现数据仓库应用程序需要3-7个汇总表。 On the other hand, keep in mind that weekly and monthly summaries can be derived (efficiently enough) from a daily summary table. 另一方面,请记住,可以从每日摘要表中(足够有效)得出每周和每月摘要。 In a couple of cases, I had an Hourly summary table for one thing, then daily tables for different things. 在某些情况下,我有一个每小时摘要表用于一件事,然后有一个每日表用于不同的事。

600M rows is big. 6亿行很大。 Will 'old' data be purged? 是否会清除“旧”数据? Once you have the summary tables you need, will the 'old' data no longer be needed? 一旦有了所需的汇总表,是否将不再需要“旧”数据? Blog on using Partitioning for such . 关于使用分区的博客

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM