如何在 SQL 服务器中使用带有框架的 window 函数进行 COUNT(DISTINCT)

Question

Piggybacking this lovely question: Partition Function COUNT() OVER possible using DISTINCT捎带这个可爱的问题： Partition Function COUNT() OVER possible using DISTINCT

I wish to calculate a moving count of distinct value.我希望计算不同值的移动计数。 Something along the lines of:类似于以下内容：

Count(distinct machine_id) over(partition by model order by _timestamp rows between 6 preceding and current row)

Obviously, SQL Server does not support the syntax.显然，SQL Server 不支持该语法。 Unfortunately, I don't understand well enough (didn't internalize would be more accurate) how that dense_rank walk-around works:不幸的是，我不太了解（没有内化会更准确）dense_rank 绕行是如何工作的：

dense_rank() over (partition by model order by machine_id) 
+ dense_rank() over (partition by model order by machine_id) 
- 1

and therefore I am not able tweak it to meet my need for a moving window.因此我无法对其进行调整以满足我对移动 window 的需求。 If I order by machine_id, would it be enough to order by _timestamp as well and use rows between ?如果我按 machine_id 订购，是否也可以按 _timestamp 订购并使用 _timestamp rows between的行？

Answer 1

dense_rank() gives the dense ranking of the the current record. dense_rank()给出当前记录的密集排名。 When you run that with ASC sort order first, you get the current record's dense rank (unique value rank) from the first element.当您首先使用ASC排序顺序运行它时，您会从第一个元素中获得当前记录的密集排名（唯一值排名）。 When you run with DESC order, then you get the current record's dense rank from the last record.当您使用DESC命令运行时，您会从最后一条记录中获得当前记录的密集排名。 Then you remove 1 because the dense ranking of the current record is counted twice.然后你删除 1 因为当前记录的密集排名被计算了两次。 This gives the total unique values in the whole partition (and repeated for every row).这给出了整个分区中的总唯一值（并为每一行重复）。

Since, dense_rank does not support frames , you can't use this solution directly.因为， dense_rank不支持frames ，你不能直接使用这个解决方案。 You need to generate the frame by other means.您需要通过其他方式生成frame 。 One way could be JOIN ing the same table with proper unique id comparisons.一种方法是通过正确的unique id比较来JOIN同一个表。 Then, you can use dense_rank on the combined version.然后，您可以在组合版本上使用dense_rank 。

Please check out the following solution proposal.请查看以下解决方案建议。 The assumption there is you have a unique record key ( record_id ) available in your table.假设您的表中有一个唯一的记录键 ( record_id )。 If you don't have a unique key, add another CTE before the first CTE and generate a unique key for each record (using new_id() function OR combining multiple columns using concat() with delimiter in between to account for NULLs )如果您没有唯一键，请在第一个 CTE 之前添加另一个 CTE 并为每条记录生成一个唯一键（使用new_id() function 或使用concat()组合多个列，中间带有分隔符以解释NULLs ）

; WITH cte AS (
SELECT 
  record_id
  , record_id_6_record_earlier = LEAD(machine_id, 6, NULL) OVER (PARTITION BY model ORDER BY _timestamp)
  , .... other columns
FROM mainTable
)
, cte2 AS (
SELECT 
  c.*
  , DistinctCntWithin6PriorRec = dense_rank() OVER (PARTITION BY c.model, c.record_id ORDER BY t._timestamp)
    + dense_rank() OVER (PARTITION BY c.model, c.record_id ORDER BY t._timestamp DESC)
    - 1
  , RN = ROW_NUMBER() OVER (PARTITION BY c.record_id ORDER BY t._timestamp )
FROM cte c
     LEFT JOIN mainTable t ON t.record_id BETWEEN c.record_id_6_record_earlier  and c.record_id
)
SELECT *
FROM cte2
WHERE RN = 1

There are 2 LIMITATIONS of this solution:此解决方案有 2 个限制：

If the frame has less than 6 records, then the LAG() function will be NULL and thus this solution will not work.如果帧的记录少于 6 条，则LAG() function 将为NULL ，因此此解决方案将不起作用。 This can be handled in different ways: One quick way I can think of is to generate 6 LEAD columns (1 record prior, 2 records prior, etc.) and then change the BETWEEN clause to something like this BETWEEN COALESCE(c.record_id_6_record_earlier, c.record_id_5_record_earlier, ...., c.record_id_1_record_earlier, c.record_id) and c.record_id这可以通过不同的方式处理：我能想到的一种快速方法是生成 6 个 LEAD 列（1 个之前的记录，2 个之前的记录等），然后将BETWEEN子句更改为类似这样的东西BETWEEN COALESCE(c.record_id_6_record_earlier, c.record_id_5_record_earlier, ...., c.record_id_1_record_earlier, c.record_id) and c.record_id
COUNT() does not count NULL . COUNT()不计算NULL 。 But DENSE_RANK does.但是DENSE_RANK可以。 You need account for that too if it applies to your data如果它适用于您的数据，您也需要考虑这一点

Answer 2

Just use outer apply :只需使用outer apply ：

select t.*, t2.num_machines
from t outer apply
     (select count(distinct t2.machine_id) as num_machines
      from (select top (6) t2.*
            from t t2
            where t2.model = t.model and
                  t2.timestamp <= t.timestamp
            order by t2.timestamp desc
           ) t2
      ) t2;

If you have a lot of rows per model, you can also use a (cumbersome) trick using lag() :如果每个 model 有很多行，您还可以使用lag()的（繁琐）技巧：

select t.*, v.num_machines
from (select t.*,
             lag(machine_id, 1) over (partition by model order by timestamp) as machine_id_1,
             lag(machine_id, 2) over (partition by model order by timestamp) as machine_id_2,
             lag(machine_id, 3) over (partition by model order by timestamp) as machine_id_3,
             lag(machine_id, 4) over (partition by model order by timestamp) as machine_id_4,
             lag(machine_id, 5) over (partition by model order by timestamp) as machine_id_5
      from t
     ) t cross apply
     (select count(distinct v.machine_id) as num_machines
      from (values (t.machine_id),
                   (t.machine_id_1),
                   (t.machine_id_2),
                   (t.machine_id_3),
                   (t.machine_id_4),
                   (t.machine_id_5)
           ) v(machine_id)
      ) v;

Under many circumstances, this might have the best performance in SQL Server.在许多情况下，这可能在 SQL 服务器中具有最佳性能。

如何在 SQL 服务器中使用带有框架的 window 函数进行 COUNT(DISTINCT)

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-08-21 17:58:37

解决方案2
2 2020-08-21 21:38:57

如何在 SQL 服务器中使用带有框架的 window 函数进行 COUNT(DISTINCT)

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-08-21 17:58:37

解决方案2 2 2020-08-21 21:38:57

解决方案1
2 已采纳 2020-08-21 17:58:37

解决方案2
2 2020-08-21 21:38:57