计数与前 24 小时的天差

Question

I have a table to consolidate stats for different types from my other tables:我有一个表来合并其他表中不同类型的统计信息：

Table name: my_stats表名： my_stats

lob         category            parameter                   total_count     timestamp                       day_difference
DSS         Industry            Advertising & Marketing     310057          2020-04-21 07:35:14.237987
DSS         Function            Administration              357351          2020-04-21 11:06:27.009658
DSS         Country             czechia                     321             2020-04-21 11:12:55.731648
DSS         Records per domain  apple.com                   65              2020-04-21 11:13:17.855059
DSS         Records per domain  Records per domain          5               2020-04-21 11:13:17.85510

DSS         Industry            Advertising & Marketing     310059          2020-04-21 10:36:14.237987
DSS         Function            Administration              357353          2020-04-21 14:08:26.009658
DSS         Country             czechia                     324             2020-04-21 14:11:55.731648
DSS         Records per domain  apple.com                   60              2020-04-21 14:08:17.855059
DSS         Records per domain  Records per domain          5               2020-04-21 14:14:17.85510

DSS         Industry            Advertising & Marketing     310058          2020-04-22 08:35:14.237987
DSS         Function            Administration              357312          2020-04-22 11:05:27.009658
DSS         Country             czechia                     201             2020-04-22 11:13:55.731648
DSS         Records per domain  apple.com                   55              2020-04-22 11:14:17.855059
DSS         Records per domain  Records per domain          2               2020-04-22 11:15:17.85510

my_stats gets updated every 3 hours. my_stats每 3 小时更新一次。 So, new entries are added every 3 hours.因此，每 3 小时添加一次新条目。 I have to find the day_difference value.我必须找到day_difference值。

The day_difference value is (count - count of subsequent row closest 24 hours prior). day_difference值为（count - 24 小时前最近的后续行的计数）。

The output of the table should be:表格的output应该是：

lob         category            parameter                   total_count     timestamp                       day_difference
DSS         Industry            Advertising & Marketing     310057          2020-04-21 07:35:14.237987      NULL
DSS         Function            Administration              357351          2020-04-21 11:06:27.009658      NULL
DSS         Country             czechia                     321             2020-04-21 11:12:55.731648      NULL
DSS         Records per domain  apple.com                   65              2020-04-21 11:13:17.855059      NULL
DSS         Records per domain  Records per domain          5               2020-04-21 11:13:17.85510       NULL

DSS         Industry            Advertising & Marketing     310059          2020-04-21 10:36:14.237987      NULL
DSS         Function            Administration              357353          2020-04-21 14:08:26.009658      NULL
DSS         Country             czechia                     324             2020-04-21 14:11:55.731648      NULL
DSS         Records per domain  apple.com                   60              2020-04-21 14:08:17.855059      NULL
DSS         Records per domain  Records per domain          5               2020-04-21 14:14:17.85510       NULL

DSS         Industry            Advertising & Marketing     310058          2020-04-22 08:35:14.237987      1
DSS         Function            Administration              357312          2020-04-22 11:05:27.009658      NULL
DSS         Country             czechia                     201             2020-04-22 11:13:55.731648      -120
DSS         Records per domain  apple.com                   55              2020-04-22 11:14:17.855059      -10
DSS         Records per domain  Records per domain          2               2020-04-22 11:15:17.85510       -3

If for the row, subsequent row prior to 24 hours is not available, then keep the day_difference = NULL.如果对于该行，24 小时之前的后续行不可用，则保留 day_difference = NULL。

Another corner case to be considered is the difference should be with the CLOSEST prior 24 hours difference.另一个要考虑的极端情况是差异应该是最近的 24 小时差异。

Is there a way I can get this result in SQL?有没有办法在 SQL 中得到这个结果？

Answer 1

This would have been a good spot to use a window function such as lag() with a range specification.这是使用 window function 的好地方，例如具有范围规范的lag() 。 Alas, Redshift only supports rows in the frame clause to window functions. las，Redshift 仅支持 frame 子句中的rows到 window 函数。

Here is an alternative that uses a correlated subquery:这是使用相关子查询的替代方法：

select
    s.*,
    total_count - (
        select total_count
        from my_stats s1
        where 
            s1.lob = s.lob
            and s1.category = s.category
            and s1.parameter = s.parameter
            and s1.timestamp < s.timestamp - interval '1 day'
        order by s1.timestamp desc
        limit 1
    ) day_diff
from my_stats s

For performance, you do want an index on (lob, category, parameter, timestamp, total_count) .为了提高性能，您确实需要(lob, category, parameter, timestamp, total_count)上的索引。

Answer 2

You can use RANK to identify the first record in the 24-hour period, then just a simple subtraction.您可以使用RANK 来识别24 小时内的第一条记录，然后进行简单的减法。 I'm assuming you're using Amazon RedShift SQL here.我假设您在这里使用的是 Amazon RedShift SQL。 If you are not, the NVL syntax is replace with ISNULL in MS SQL.如果不是，则NVL语法在 MS SQL 中被替换为ISNULL 。

https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html

SELECT
  r.lob
 ,r.category
 ,r.parameter
 ,r.total_count
 ,r.timestamp
 ,i.day_difference
FROM dbo.my_stats r
LEFT JOIN (

  SELECT
    x.lob
   ,x.category
   ,x.parameter
   ,x.total_count
   ,x.timestamp
   ,RANK() OVER (PARTITION BY p.lob,p.category,p.parameter ORDER BY p.timestamp) AS rank_order
   ,x.total_count - p.total_count AS day_difference
  FROM dbo.my_stats x
  INNER JOIN dbo.my_stats p  --> lookback period
    ON p.lob       = x.lob
   AND p.category  = x.category
   AND p.parameter = x.parameter
   AND p.timestamp < x.timestamp
   AND p.timestamp > DATEADD(HOUR,-24,x.timestamp)

) i
 ON i.lob       = r.lob
AND i.category  = r.category
AND i.parameter = r.parameter
AND i.timestamp = r.timestamp
WHERE NVL(i.rank_order,1)=1

计数与前 24 小时的天差

问题描述

2 个解决方案

解决方案1
3 2020-04-26 19:01:24

解决方案2
1 2020-05-03 02:11:17

计数与前 24 小时的天差

问题描述

2 个解决方案

解决方案1 3 2020-04-26 19:01:24

解决方案2 1 2020-05-03 02:11:17

解决方案1
3 2020-04-26 19:01:24

解决方案2
1 2020-05-03 02:11:17