如何有效地选择基于SQL中不同时间戳计算的多个总和的平均总和？

Question

I have a database table looking like the following: 我有一个数据库表，如下所示：

id | macaddr | load | timestamp
=========================================
 1 | 0011111 |   17 | 2012-02-07 10:00:00
 1 | 0011111 |    6 | 2012-02-07 12:00:00
 2 | 0022222 |    3 | 2012-02-07 12:00:03
 3 | 0033333 |    9 | 2012-02-07 12:00:04
 4 | 0022222 |    4 | 2012-02-07 12:00:06
 5 | 0033333 |    8 | 2012-02-07 12:00:10
...

Now, I would like to calculate the average load over all devices (= mac addresses) for different time periods, eg, for today, yesterday, this week, this month. 现在，我想计算不同时间段（例如今天，昨天，本周，本月）所有设备（= mac地址）的平均负载。

An average load can be calculated by first finding out the overall load sum at different points in time (sample dates) and then calculating the average of the load sums for these sample dates. 通过首先找出不同时间点（样本日期）的总负荷总和，然后计算这些样本日期的负荷总和的平均值，可以计算平均负荷。 For example, if I wanted the average load of the last ten seconds (and now is 2012-02-07 12:00:10), I could decide about my sample dates to be at 12:00:02, 12:00:04, 12:00:06, 12:00:08, and 12:00:10. 例如，如果我希望最近十秒钟的平均负载（现在是2012-02-07 12:00:10），则可以将采样日期确定为12：00：02、12：00： 04、12：00：06、12：00：08和12:00:10。 Then, I would calculate the load sums by summing up the most recent load values for each device: 然后，我将通过汇总每个设备的最新负载值来计算负载总和：

2012-02-07 12:00:02 |  6  [= load(id=2)]
2012-02-07 12:00:04 | 18  [= load(id=2) + load(id=3) + load(id=4)]
2012-02-07 12:00:06 | 19  [= load(id=2) + load(id=4) + load(id=5)]
2012-02-07 12:00:08 | 19  [= load(id=2) + load(id=4) + load(id=5)]
2012-02-07 12:00:10 | 18  [= load(id=2) + load(id=5) + load(id=6)]

A device's load value is ignored if it is older than, eg, an hour (happened here to id=1). 如果设备的负载值早于一个小时（此处为id = 1），则该负载值将被忽略。 The average would be 16 in this case. 在这种情况下，平均值为16。

Currently, I generate a rather complex query with many "UNION ALL" statements which is reeeeally slow: 当前，我使用许多“ UNION ALL”语句生成了一个相当复杂的查询，该语句非常慢：

SELECT avg(l.load_sum) as avg_load
FROM (
    SELECT sum(so.load) AS load_sum 
    FROM (
        SELECT * 
        FROM (
            SELECT si.macaddr, si.load 
            FROM sensor_data si WHERE si.timestamp > '2012-02-07 11:00:10' AND si.timestamp < '2012-02-07 12:00:10'
            ORDER BY si.timestamp DESC 
        ) AS sm
        GROUP BY macaddr
    ) so
    UNION ALL
    [THE SAME THING AGAIN WITH OTHER TIMESTAMPS]
    UNION ALL
    [AND AGAIN]
    UNION ALL
    [AND AGAIN]
    ...
) l

Now imagine I would like to calculate the average load for a whole month. 现在想象一下，我想计算一个月的平均负载。 With hourly sample dates I need to join 30x24=720 queries using the UNION ALL statement. 对于每小时的采样日期，我需要使用UNION ALL语句加入30x24 = 720个查询。 The overall query takes nearly a minute to complete on my machine. 整个查询需要将近一分钟才能在我的计算机上完成。 I am sure there is a much better solution without the UNION ALL statement. 我相信没有UNION ALL语句会有更好的解决方案。 However, I did not find anything useful on the Web. 但是，我在网络上找不到任何有用的东西。 I would therefore be very thankful for your help! 因此，非常感谢您的帮助！

Answer 1

Use a fraction of the unix timestamp: First we formulate the hourly (3600 seconds) averages: 使用Unix时间戳的一小部分：首先，我们计算每小时（3600秒）的平均值：

SELECT
  macaddr, 
  sum(CAST(load AS float))/CAST(count(*) AS float) AS loadavg,
  FLOOR(UNIX_TIMESTAMP(`timestamp`)/3600) AS hourbase
FROM sensor_data
GROUP BY macaddr,FLOOR(UNIX_TIMESTAMP(`timestamp`)/3600)

Then we average those over the month 然后我们平均一个月

SELECT 
  avg(loadavg) as monthlyavg,
  macaddr
FROM (
    SELECT
      macaddr, 
      sum(CAST(load AS float))/CAST(count(*) AS float) AS loadavg,
      FLOOR(UNIX_TIMESTAMP(`timestamp`)/3600) AS hourbase
    FROM sensor_data
    WHERE `timestamp` BETWEEN '2012-01-07 12:00:00' AND '2012-02-07 11:59:59'
    GROUP BY macaddr,FLOOR(UNIX_TIMESTAMP(`timestamp`)/3600)
) AS hourlies
GROUP BY macaddr, hourbase

Answer 2

To make things easier for yourself you should create an "hour" function, that returns a datetime with no significant figures after the hour part. 为了使事情变得更容易，您应该创建一个“小时”函数，该函数返回一个日期时间，小时部分之后没有任何有效数字。 So right now (2/2/2012 5:05pm) would be 2012-02-07 17:00. 因此，现在（2012年2月2日下午5:05）将是2012-02-07 17:00。 Here's the code for your hour function: 这是您的小时函数的代码：

select dateadd(hh, DATEPART(hh, current_timestamp), DATEADD(dd, 0, datediff(dd, 0, current_timestamp)))

(replace current_timestamp in the above code with the datetime parameter of your hour function. I'll assume you created it as dbo.fnHour(), and it takes a datetime parameter. （将上述代码中的current_timestamp替换为小时函数的datetime参数。我假设您将其创建为dbo.fnHour（），并且它带有datetime参数。

You can then use the dbo.fnHour() as a partitioning function to query what you want. 然后，您可以使用dbo.fnHour（）作为分区函数来查询所需的内容。 Your sql will look something like this: 您的sql看起来像这样：

select 
    avg(load) as avg_load
from (
    select dbo.fnHour(si.timestamp) [hour], macaddr, sum(load) as [load]
    from 
        sensor_data si 
    where 
        si.timestamp >= dateadd(mm, -1, current_timestamp)
    group by 
        dbo.fnHour(si.timestamp), macaddr
) as f

I haven't tested it so there may be some typos but this should be enough to get you going. 我没有测试过，所以可能会有一些错别字，但这足以让您前进。

Answer 3

I may be misunderstanding what you are trying to do. 我可能会误解您想要做什么。 It looks like you are making things a lot more complicated than they need to be using the sampling. 看起来您正在使事情变得比使用采样要复杂得多。 Perhaps giving samples of what the result should look like would allow people to provide better solutions for your particular case. 给出结果看起来应该是什么样的样本，也许可以使人们为您的特定案例提供更好的解决方案。

mysql> SELECT * FROM `test`;
+----+-----+------+------------+
| id | mac | load | when       |
+----+-----+------+------------+
|  1 |   1 |   10 | 2012-02-01 |
|  2 |   1 |   20 | 2012-01-01 |
|  3 |   2 |   60 | 2011-09-01 |
+----+-----+------+------------+

mysql> SELECT avg(`sum_load`)
    -> FROM 
    -> (
    ->    SELECT sum( `load` ) as sum_load
    ->    FROM `test`
    ->    WHERE `when` > '2011-01-15'
    ->    GROUP BY `mac`
    -> ) as t1;
+-----------------+
| avg(`sum_load`) |
+-----------------+
|         45.0000 |
+-----------------+

mysql> SELECT avg(`sum_load`)
    -> FROM 
    -> (
    ->    SELECT sum( `load` ) as sum_load
    ->    FROM `test`
    ->    WHERE `when` > '2011-01-15' AND `when` < '2012-01-15'
    ->    GROUP BY `mac`
    -> ) as t1;
+-----------------+
| avg(`sum_load`) |
+-----------------+
|         40.0000 |
+-----------------+

如何有效地选择基于SQL中不同时间戳计算的多个总和的平均总和？

问题描述

3 个解决方案

解决方案1
1 2012-02-07 22:16:03

解决方案2
0 2012-02-07 22:12:08

解决方案3
0 2012-02-07 22:12:40

如何有效地选择基于SQL中不同时间戳计算的多个总和的平均总和？

问题描述

3 个解决方案

解决方案1 1 2012-02-07 22:16:03

解决方案2 0 2012-02-07 22:12:08

解决方案3 0 2012-02-07 22:12:40

解决方案1
1 2012-02-07 22:16:03

解决方案2
0 2012-02-07 22:12:08

解决方案3
0 2012-02-07 22:12:40