简体   繁体   English

如何在 SQL 服务器中使用带有框架的 window 函数进行 COUNT(DISTINCT)

[英]How to do a COUNT(DISTINCT) using window functions with a frame in SQL Server

Piggybacking this lovely question: Partition Function COUNT() OVER possible using DISTINCT捎带这个可爱的问题: Partition Function COUNT() OVER possible using DISTINCT

I wish to calculate a moving count of distinct value.我希望计算不同值的移动计数。 Something along the lines of:类似于以下内容:

Count(distinct machine_id) over(partition by model order by _timestamp rows between 6 preceding and current row)

Obviously, SQL Server does not support the syntax.显然,SQL Server 不支持该语法。 Unfortunately, I don't understand well enough (didn't internalize would be more accurate) how that dense_rank walk-around works:不幸的是,我不太了解(没有内化会更准确)dense_rank 绕行是如何工作的:

dense_rank() over (partition by model order by machine_id) 
+ dense_rank() over (partition by model order by machine_id) 
- 1

and therefore I am not able tweak it to meet my need for a moving window.因此我无法对其进行调整以满足我对移动 window 的需求。 If I order by machine_id, would it be enough to order by _timestamp as well and use rows between ?如果我按 machine_id 订购,是否也可以按 _timestamp 订购并使用 _timestamp rows between的行?

dense_rank() gives the dense ranking of the the current record. dense_rank()给出当前记录的密集排名。 When you run that with ASC sort order first, you get the current record's dense rank (unique value rank) from the first element.当您首先使用ASC排序顺序运行它时,您会从第一个元素中获得当前记录的密集排名(唯一值排名)。 When you run with DESC order, then you get the current record's dense rank from the last record.当您使用DESC命令运行时,您会从最后一条记录中获得当前记录的密集排名。 Then you remove 1 because the dense ranking of the current record is counted twice.然后你删除 1 因为当前记录的密集排名被计算了两次。 This gives the total unique values in the whole partition (and repeated for every row).这给出了整个分区中的总唯一值(并为每一行重复)。

Since, dense_rank does not support frames , you can't use this solution directly.因为, dense_rank不支持frames ,你不能直接使用这个解决方案。 You need to generate the frame by other means.您需要通过其他方式生成frame One way could be JOIN ing the same table with proper unique id comparisons.一种方法是通过正确的unique id比较来JOIN同一个表。 Then, you can use dense_rank on the combined version.然后,您可以在组合版本上使用dense_rank

Please check out the following solution proposal.请查看以下解决方案建议。 The assumption there is you have a unique record key ( record_id ) available in your table.假设您的表中有一个唯一的记录键 ( record_id )。 If you don't have a unique key, add another CTE before the first CTE and generate a unique key for each record (using new_id() function OR combining multiple columns using concat() with delimiter in between to account for NULLs )如果您没有唯一键,请在第一个 CTE 之前添加另一个 CTE 并为每条记录生成一个唯一键(使用new_id() function 或使用concat()组合多个列,中间带有分隔符以解释NULLs

; WITH cte AS (
SELECT 
  record_id
  , record_id_6_record_earlier = LEAD(machine_id, 6, NULL) OVER (PARTITION BY model ORDER BY _timestamp)
  , .... other columns
FROM mainTable
)
, cte2 AS (
SELECT 
  c.*
  , DistinctCntWithin6PriorRec = dense_rank() OVER (PARTITION BY c.model, c.record_id ORDER BY t._timestamp)
    + dense_rank() OVER (PARTITION BY c.model, c.record_id ORDER BY t._timestamp DESC)
    - 1
  , RN = ROW_NUMBER() OVER (PARTITION BY c.record_id ORDER BY t._timestamp )
FROM cte c
     LEFT JOIN mainTable t ON t.record_id BETWEEN c.record_id_6_record_earlier  and c.record_id
)
SELECT *
FROM cte2
WHERE RN = 1

There are 2 LIMITATIONS of this solution:此解决方案有 2 个限制:

  1. If the frame has less than 6 records, then the LAG() function will be NULL and thus this solution will not work.如果帧的记录少于 6 条,则LAG() function 将为NULL ,因此此解决方案将不起作用。 This can be handled in different ways: One quick way I can think of is to generate 6 LEAD columns (1 record prior, 2 records prior, etc.) and then change the BETWEEN clause to something like this BETWEEN COALESCE(c.record_id_6_record_earlier, c.record_id_5_record_earlier, ...., c.record_id_1_record_earlier, c.record_id) and c.record_id这可以通过不同的方式处理:我能想到的一种快速方法是生成 6 个 LEAD 列(1 个之前的记录,2 个之前的记录等),然后将BETWEEN子句更改为类似这样的东西BETWEEN COALESCE(c.record_id_6_record_earlier, c.record_id_5_record_earlier, ...., c.record_id_1_record_earlier, c.record_id) and c.record_id

  2. COUNT() does not count NULL . COUNT()不计算NULL But DENSE_RANK does.但是DENSE_RANK可以。 You need account for that too if it applies to your data如果它适用于您的数据,您也需要考虑这一点

Just use outer apply :只需使用outer apply

select t.*, t2.num_machines
from t outer apply
     (select count(distinct t2.machine_id) as num_machines
      from (select top (6) t2.*
            from t t2
            where t2.model = t.model and
                  t2.timestamp <= t.timestamp
            order by t2.timestamp desc
           ) t2
      ) t2;

If you have a lot of rows per model, you can also use a (cumbersome) trick using lag() :如果每个 model 有很多行,您还可以使用lag()的(繁琐)技巧:

select t.*, v.num_machines
from (select t.*,
             lag(machine_id, 1) over (partition by model order by timestamp) as machine_id_1,
             lag(machine_id, 2) over (partition by model order by timestamp) as machine_id_2,
             lag(machine_id, 3) over (partition by model order by timestamp) as machine_id_3,
             lag(machine_id, 4) over (partition by model order by timestamp) as machine_id_4,
             lag(machine_id, 5) over (partition by model order by timestamp) as machine_id_5
      from t
     ) t cross apply
     (select count(distinct v.machine_id) as num_machines
      from (values (t.machine_id),
                   (t.machine_id_1),
                   (t.machine_id_2),
                   (t.machine_id_3),
                   (t.machine_id_4),
                   (t.machine_id_5)
           ) v(machine_id)
      ) v;

Under many circumstances, this might have the best performance in SQL Server.在许多情况下,这可能在 SQL 服务器中具有最佳性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM