简体   繁体   English

带窗口函数的 SQL 中每个分区的 last_value 总和

[英]Sum of last_value of each partition in SQL with window functions

I have a table that stores total disk used at any point in time for each entity.我有一个表,用于存储每个实体在任何时间点使用的总磁盘。 I want to find the peak disk used in a time period.我想找到某个时间段内使用的峰值磁盘。 For example, the data looks something like例如,数据看起来像

Note: The timestamp is actual timestamp with seconds precision, I set it to 10am etc for brevity注意:时间戳是具有秒精度的实际时间戳,为简洁起见,我将其设置为 10am 等

timestamp | entity_id | disk_used
---------------------------------
    9am   |         1 |  10
   10am   |         2 |  20
   11am   |         2 |  15
   12am   |         1 |  12
     

In this example, the max disk used at is 30 (10 from entity 1 and 20 from entity 2).在此示例中,使用的最大磁盘为 30(实体 1 为 10,实体 2 为 20)。

I have tried a number of approaches.我尝试了多种方法。

  1. Sum of (max of each entity) does't work because it would give the result 20 + 12 = 32. But before the entity 1 increased its size, the entity 2 reduced the size, so the peak disk usage was 30. Sum of (max of each entity) 不起作用,因为它会给出结果 20 + 12 = 32。但在实体 1 增加其大小之前,实体 2 减小了大小,因此峰值磁盘使用量为 30。
  2. I tried to use window function to find the sum of last_value of each entity我尝试使用窗口函数来查找每个实体的 last_value 的总和
select timestamp, entity_id,
    disk_used, 
    sum(last_value(disk_used) over(
        partition by entity_id order by timestamp)
    ) sum_of_last

attempting to generate, so I can then max of it,试图生成,所以我可以最大,

timestamp | entity_id | disk_used | sum_of_last
-----------------------------------------------
    9am   |         1 |  10       |   10
   10am   |         2 |  20       |   30
   11am   |         2 |  15       |   25       // (10 + 15)
   12am   |         1 |  12       |   27       // (12 + 15)
     

however, that query doesn't work because we cannot aggregate over a window function in ISO Standard SQL 2003. I am using Amazon timestream db.但是,该查询不起作用,因为我们无法通过 ISO 标准 SQL 2003 中的窗口函数进行聚合。我使用的是 Amazon timestream db。 The query engine is compatible with ISO Standard SQL 2003.查询引擎与 ISO 标准 SQL 2003 兼容。

-- Rephrasing the same question, at each timestamp we have the data point, for the total disk used at that instant. -- 重新表述相同的问题,在每个时间戳,我们都有数据点,用于该时刻使用的总磁盘。 To find the total total disk used at that instant, sum the last value of each entity.要找到当时使用的总磁盘总数,请对每个实体的最后一个值求和。

Is there an effective way to compute this?有没有一种有效的方法来计算这个?

If you have only two entities, you can do:如果您只有两个实体,您可以执行以下操作:

select t.*,
       (last_value(case when entity_id = 1 then disk_used end ignore nulls) over (order by time) +
        last_value(case when entity_id = 2 then disk_used end ignore nulls) over (order by time)
       ) as total        
from t;

One way to generalize this for all entities is to generate a row for each entity at each time, impute the value and aggregate:对所有实体进行概括的一种方法是每次为每个实体生成一行,估算值并聚合:

select ti.time, e.entity_id,
       last_value(disk_used ignore nulls) over (partition by e.entity_id order by t.time) as imputed_disk_used
from (select distinct time from t) ti cross join
     (select distinct entity_id from t) e left join
     t
     on ti.time = t.time and e.entity_id = t.entity_id;

Then you can aggregate:然后你可以聚合:

select time, sum(imputed_disk_used)
from (select ti.time, e.entity_id,
             last_value(disk_used ignore nulls) over (partition by e.entity_id order by t.time) as imputed_disk_used
      from (select distinct time from t) ti cross join
           (select distinct entity_id from t) e left join
           t
           on ti.time = t.time and e.entity_id = t.entity_id
     ) te
group by time;

However, this gives that value per time rather than per time and entity_id .但是,这给出了每次而不是每次和entity_id

I want to find the peak disk used in a time period我想找到某个时间段内使用的峰值磁盘

You can use two levels of aggregation:您可以使用两个级别的聚合:

select max(sum_disk_used)
from (
    select time, sum(disk_used) as sum_disk_used
    from mytable
    group by time
) t

The subquery computest the total disk_used at each point in time, then the outer query gets the peak value only.子查询计算每个时间点的总disk_used使用量,然后外部查询仅获取峰值。

If your database supports some kind of limit clause, this can be simplified:如果您的数据库支持某种limit子句,则可以简化:

select time, sum(disk_used) as sum_disk_used
from mytable
group by time
order by sum_disk_used limit 1

To filter on a given period, you would typically add a where clause to the subquery.要过滤给定的时间段,您通常会向子查询添加where子句。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM