[英]How to do a COUNT(DISTINCT) using window functions with a frame in SQL Server
Piggybacking this lovely question: Partition Function COUNT() OVER possible using DISTINCT捎带这个可爱的问题: Partition Function COUNT() OVER possible using DISTINCT
I wish to calculate a moving count of distinct value.我希望计算不同值的移动计数。 Something along the lines of:
类似于以下内容:
Count(distinct machine_id) over(partition by model order by _timestamp rows between 6 preceding and current row)
Obviously, SQL Server does not support the syntax.显然,SQL Server 不支持该语法。 Unfortunately, I don't understand well enough (didn't internalize would be more accurate) how that dense_rank walk-around works:
不幸的是,我不太了解(没有内化会更准确)dense_rank 绕行是如何工作的:
dense_rank() over (partition by model order by machine_id)
+ dense_rank() over (partition by model order by machine_id)
- 1
and therefore I am not able tweak it to meet my need for a moving window.因此我无法对其进行调整以满足我对移动 window 的需求。 If I order by machine_id, would it be enough to order by _timestamp as well and use
rows between
?如果我按 machine_id 订购,是否也可以按 _timestamp 订购并使用 _timestamp
rows between
的行?
dense_rank()
gives the dense ranking of the the current record. dense_rank()
给出当前记录的密集排名。 When you run that with ASC
sort order first, you get the current record's dense rank (unique value rank) from the first element.当您首先使用
ASC
排序顺序运行它时,您会从第一个元素中获得当前记录的密集排名(唯一值排名)。 When you run with DESC
order, then you get the current record's dense rank from the last record.当您使用
DESC
命令运行时,您会从最后一条记录中获得当前记录的密集排名。 Then you remove 1 because the dense ranking of the current record is counted twice.然后你删除 1 因为当前记录的密集排名被计算了两次。 This gives the total unique values in the whole partition (and repeated for every row).
这给出了整个分区中的总唯一值(并为每一行重复)。
Since, dense_rank
does not support frames
, you can't use this solution directly.因为,
dense_rank
不支持frames
,你不能直接使用这个解决方案。 You need to generate the frame
by other means.您需要通过其他方式生成
frame
。 One way could be JOIN
ing the same table with proper unique id
comparisons.一种方法是通过正确的
unique id
比较来JOIN
同一个表。 Then, you can use dense_rank
on the combined version.然后,您可以在组合版本上使用
dense_rank
。
Please check out the following solution proposal.请查看以下解决方案建议。 The assumption there is you have a unique record key (
record_id
) available in your table.假设您的表中有一个唯一的记录键 (
record_id
)。 If you don't have a unique key, add another CTE before the first CTE and generate a unique key for each record (using new_id()
function OR combining multiple columns using concat()
with delimiter in between to account for NULLs
)如果您没有唯一键,请在第一个 CTE 之前添加另一个 CTE 并为每条记录生成一个唯一键(使用
new_id()
function 或使用concat()
组合多个列,中间带有分隔符以解释NULLs
)
; WITH cte AS (
SELECT
record_id
, record_id_6_record_earlier = LEAD(machine_id, 6, NULL) OVER (PARTITION BY model ORDER BY _timestamp)
, .... other columns
FROM mainTable
)
, cte2 AS (
SELECT
c.*
, DistinctCntWithin6PriorRec = dense_rank() OVER (PARTITION BY c.model, c.record_id ORDER BY t._timestamp)
+ dense_rank() OVER (PARTITION BY c.model, c.record_id ORDER BY t._timestamp DESC)
- 1
, RN = ROW_NUMBER() OVER (PARTITION BY c.record_id ORDER BY t._timestamp )
FROM cte c
LEFT JOIN mainTable t ON t.record_id BETWEEN c.record_id_6_record_earlier and c.record_id
)
SELECT *
FROM cte2
WHERE RN = 1
There are 2 LIMITATIONS of this solution:此解决方案有 2 个限制:
If the frame has less than 6 records, then the LAG()
function will be NULL
and thus this solution will not work.如果帧的记录少于 6 条,则
LAG()
function 将为NULL
,因此此解决方案将不起作用。 This can be handled in different ways: One quick way I can think of is to generate 6 LEAD columns (1 record prior, 2 records prior, etc.) and then change the BETWEEN
clause to something like this BETWEEN COALESCE(c.record_id_6_record_earlier, c.record_id_5_record_earlier, ...., c.record_id_1_record_earlier, c.record_id) and c.record_id
这可以通过不同的方式处理:我能想到的一种快速方法是生成 6 个 LEAD 列(1 个之前的记录,2 个之前的记录等),然后将
BETWEEN
子句更改为类似这样的东西BETWEEN COALESCE(c.record_id_6_record_earlier, c.record_id_5_record_earlier, ...., c.record_id_1_record_earlier, c.record_id) and c.record_id
COUNT()
does not count NULL
. COUNT()
不计算NULL
。 But DENSE_RANK
does.但是
DENSE_RANK
可以。 You need account for that too if it applies to your data如果它适用于您的数据,您也需要考虑这一点
Just use outer apply
:只需使用
outer apply
:
select t.*, t2.num_machines
from t outer apply
(select count(distinct t2.machine_id) as num_machines
from (select top (6) t2.*
from t t2
where t2.model = t.model and
t2.timestamp <= t.timestamp
order by t2.timestamp desc
) t2
) t2;
If you have a lot of rows per model, you can also use a (cumbersome) trick using lag()
:如果每个 model 有很多行,您还可以使用
lag()
的(繁琐)技巧:
select t.*, v.num_machines
from (select t.*,
lag(machine_id, 1) over (partition by model order by timestamp) as machine_id_1,
lag(machine_id, 2) over (partition by model order by timestamp) as machine_id_2,
lag(machine_id, 3) over (partition by model order by timestamp) as machine_id_3,
lag(machine_id, 4) over (partition by model order by timestamp) as machine_id_4,
lag(machine_id, 5) over (partition by model order by timestamp) as machine_id_5
from t
) t cross apply
(select count(distinct v.machine_id) as num_machines
from (values (t.machine_id),
(t.machine_id_1),
(t.machine_id_2),
(t.machine_id_3),
(t.machine_id_4),
(t.machine_id_5)
) v(machine_id)
) v;
Under many circumstances, this might have the best performance in SQL Server.在许多情况下,这可能在 SQL 服务器中具有最佳性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.