[英]Day difference of counts from prior 24 hours
I have a table to consolidate stats for different types from my other tables:我有一个表来合并其他表中不同类型的统计信息:
Table name: my_stats
表名: my_stats
lob category parameter total_count timestamp day_difference
DSS Industry Advertising & Marketing 310057 2020-04-21 07:35:14.237987
DSS Function Administration 357351 2020-04-21 11:06:27.009658
DSS Country czechia 321 2020-04-21 11:12:55.731648
DSS Records per domain apple.com 65 2020-04-21 11:13:17.855059
DSS Records per domain Records per domain 5 2020-04-21 11:13:17.85510
DSS Industry Advertising & Marketing 310059 2020-04-21 10:36:14.237987
DSS Function Administration 357353 2020-04-21 14:08:26.009658
DSS Country czechia 324 2020-04-21 14:11:55.731648
DSS Records per domain apple.com 60 2020-04-21 14:08:17.855059
DSS Records per domain Records per domain 5 2020-04-21 14:14:17.85510
DSS Industry Advertising & Marketing 310058 2020-04-22 08:35:14.237987
DSS Function Administration 357312 2020-04-22 11:05:27.009658
DSS Country czechia 201 2020-04-22 11:13:55.731648
DSS Records per domain apple.com 55 2020-04-22 11:14:17.855059
DSS Records per domain Records per domain 2 2020-04-22 11:15:17.85510
my_stats
gets updated every 3 hours. my_stats
每 3 小时更新一次。 So, new entries are added every 3 hours.因此,每 3 小时添加一次新条目。 I have to find the day_difference
value.我必须找到day_difference
值。
The day_difference
value is (count - count of subsequent row closest 24 hours prior). day_difference
值为(count - 24 小时前最近的后续行的计数)。
The output of the table should be:表格的output应该是:
lob category parameter total_count timestamp day_difference
DSS Industry Advertising & Marketing 310057 2020-04-21 07:35:14.237987 NULL
DSS Function Administration 357351 2020-04-21 11:06:27.009658 NULL
DSS Country czechia 321 2020-04-21 11:12:55.731648 NULL
DSS Records per domain apple.com 65 2020-04-21 11:13:17.855059 NULL
DSS Records per domain Records per domain 5 2020-04-21 11:13:17.85510 NULL
DSS Industry Advertising & Marketing 310059 2020-04-21 10:36:14.237987 NULL
DSS Function Administration 357353 2020-04-21 14:08:26.009658 NULL
DSS Country czechia 324 2020-04-21 14:11:55.731648 NULL
DSS Records per domain apple.com 60 2020-04-21 14:08:17.855059 NULL
DSS Records per domain Records per domain 5 2020-04-21 14:14:17.85510 NULL
DSS Industry Advertising & Marketing 310058 2020-04-22 08:35:14.237987 1
DSS Function Administration 357312 2020-04-22 11:05:27.009658 NULL
DSS Country czechia 201 2020-04-22 11:13:55.731648 -120
DSS Records per domain apple.com 55 2020-04-22 11:14:17.855059 -10
DSS Records per domain Records per domain 2 2020-04-22 11:15:17.85510 -3
If for the row, subsequent row prior to 24 hours is not available, then keep the day_difference = NULL.如果对于该行,24 小时之前的后续行不可用,则保留 day_difference = NULL。
Another corner case to be considered is the difference should be with the CLOSEST prior 24 hours difference.另一个要考虑的极端情况是差异应该是最近的 24 小时差异。
Is there a way I can get this result in SQL?有没有办法在 SQL 中得到这个结果?
This would have been a good spot to use a window function such as lag()
with a range specification.这是使用 window function 的好地方,例如具有范围规范的lag()
。 Alas, Redshift only supports rows
in the frame clause to window functions. las,Redshift 仅支持 frame 子句中的rows
到 window 函数。
Here is an alternative that uses a correlated subquery:这是使用相关子查询的替代方法:
select
s.*,
total_count - (
select total_count
from my_stats s1
where
s1.lob = s.lob
and s1.category = s.category
and s1.parameter = s.parameter
and s1.timestamp < s.timestamp - interval '1 day'
order by s1.timestamp desc
limit 1
) day_diff
from my_stats s
For performance, you do want an index on (lob, category, parameter, timestamp, total_count)
.为了提高性能,您确实需要(lob, category, parameter, timestamp, total_count)
上的索引。
You can use RANK to identify the first record in the 24-hour period, then just a simple subtraction.您可以使用RANK 来识别24 小时内的第一条记录,然后进行简单的减法。 I'm assuming you're using Amazon RedShift SQL here.我假设您在这里使用的是 Amazon RedShift SQL。 If you are not, the NVL
syntax is replace with ISNULL
in MS SQL.如果不是,则NVL
语法在 MS SQL 中被替换为ISNULL
。
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html
SELECT
r.lob
,r.category
,r.parameter
,r.total_count
,r.timestamp
,i.day_difference
FROM dbo.my_stats r
LEFT JOIN (
SELECT
x.lob
,x.category
,x.parameter
,x.total_count
,x.timestamp
,RANK() OVER (PARTITION BY p.lob,p.category,p.parameter ORDER BY p.timestamp) AS rank_order
,x.total_count - p.total_count AS day_difference
FROM dbo.my_stats x
INNER JOIN dbo.my_stats p --> lookback period
ON p.lob = x.lob
AND p.category = x.category
AND p.parameter = x.parameter
AND p.timestamp < x.timestamp
AND p.timestamp > DATEADD(HOUR,-24,x.timestamp)
) i
ON i.lob = r.lob
AND i.category = r.category
AND i.parameter = r.parameter
AND i.timestamp = r.timestamp
WHERE NVL(i.rank_order,1)=1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.