简体   繁体   English

如何在相关子查询中的MySQL中计算移动平均值?

[英]How to calculate a moving average in MySQL in a correlated subquery?

I want to create a timeline report that shows, for each date in the timeline, a moving average of the latest N data points in a data set that has some measures and the dates they were measured. 我想创建一个时间线报告,该报告针对时间线中的每个日期显示具有某些度量和度量日期的数据集中最近N个数据点的移动平均值。 I have a calendar table populated with every day to provide the dates. 我有一个日历表,每天填充以提供日期。 I can calculate a timeline to show the overall average prior to that date fairly simply with a correlated subquery (the real situation is much more complex than this, but it can essentially be simplified to this): 我可以计算一个时间轴,以使用相关子查询相当简单地显示该日期之前的总体平均水平(实际情况比这要复杂得多,但实际上可以简化为这一点):

SELECT  c.date
,       (   SELECT  AVERAGE(m.value) 
            FROM    measures as m
            WHERE   m.measured_on_dt <= c.date
        ) as `average_to_date`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

I've spent days reading around this and I've not found any good solutions. 我花了几天时间阅读有关此内容的信息,但没有找到任何好的解决方案。 Some have suggested that LIMIT might work in the subquery (LIMIT is supported in subqueries the current version of MySQL), however LIMIT applies to the return set, not the rows going into the aggregate, so it makes no difference to add it. 有人建议LIMIT可能在子查询中起作用(当前版本的MySQL的子查询中支持LIMIT),但是LIMIT适用于返回集,而不适用于进入聚合的行,因此添加它没有区别。

Nor can I write a non-aggregated SELECT with a LIMIT and then aggregate over that, because a correlated subquery is not allowed inside a FROM statement. 我也不能编写带有LIMIT的非聚合SELECT,然后对其进行聚合,因为在FROM语句中不允许相关子查询。 So this (sadly) WON'T work: 因此,这(可悲)将无法正常工作:

SELECT  c.date
,       SELECT AVERAGE(last_5.value)
        FROM (  SELECT  m.value
                FROM    measures as m
                WHERE   m.measured_on_dt <= c.date
                ORDER BY m.measured_on_dt DESC
                LIMIT 5
              ) as `last_5`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

I'm thinking I need to avoid the subquery approach completely and see if I do this with a clever join / row numbering technique with user-variables and then aggregate that but while I'm working on that I thought I'd ask if anyone knew a better method? 我在想我需要完全避免使用子查询方法,看看我是否使用带有用户变量的巧妙的联接/行编号技术来做到这一点,然后将其汇总,但是在我从事这一工作时,我想我会问是否有人知道更好的方法吗?

UPDATE: Okay, I've got a solution working which I've simplified for this example. 更新:好的,我有一个解决方案,在此示例中已对其进行了简化。 It relies on some user-variable trickery to number the measures backwards from the calendar date. 它依靠一些用户变量的技巧来从日历日期向后编号度量。 It also does a cross product with the calendar table (instead of a subquery) but this has the unfortunate side-effect of causing the row-numbering trick to fail (user-variables are evaluated when they're sent to the client, not when the row is evaluated) so to workaround this, I've had to nest the query one level, order the results and then apply the row-numbering trick to that set, which then works. 它还与日历表(而不是子查询)做叉积,但这具有导致行编号技巧失败的不幸副作用(用户变量在发送给客户端时进行评估,而不是在发送给客户端时进行评估。因此,要解决此问题,我必须将查询嵌套一级,对结果进行排序,然后将行编号技巧应用到该集合,然后该行才能起作用。

This query only returns calendar dates for which there are measures, so if you wanted the whole timeline you'd simply select the calendar and LEFT JOIN to this result set. 该查询仅返回可以进行度量的日历日期,因此,如果您想要整个时间轴,则只需选择日历并向该结果集左移JOIN。

set @day = 0;
set @num = 0;
set @LIMIT = 5;

SELECT  date
,       AVG(value) as recent_N_AVG
FROM
(  SELECT *
  ,      @num := if(@day = c.date, @num + 1, 1) as day_row_number
  ,      @day := day as dummy
  FROM 
  ( SELECT  c.full_date
    ,       m.value
    ,       m.measured_on_dt
    FROM    calendar c 
    JOIN    measures as m
    WHERE   m.measured_on_dt <= c.full_date
    AND     c.full_date BETWEEN date1 AND date2  
    ORDER BY c.full_date ASC, measured_on_dt DESC
  ) as full_data
) as numbered
WHERE day_row_number <= @LIMIT
GROUP BY date

The row numbering trick can be generalised to more complex data (my measures are in several dimensions which need aggregating up). 行编号技巧可以推广到更复杂的数据(我的度量在几个维度上需要汇总)。

If your timeline is continuous (1 value each day) you could improve your first attempt like this: 如果您的时间轴是连续的(每天1个值),则可以像这样改善首次尝试:

SELECT c.date,
       ( SELECT AVERAGE(m.value) 
         FROM   measures as m
         WHERE  m.measured_on_dt 
                    BETWEEN DATE_SUB(c.date, INTERVAL 5 day) AND c.date
       ) as `average_to_date`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

If your timeline has holes in it this would result in less than 5 values for the average. 如果您的时间轴上有漏洞,那么平均值将少于5个值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM