简体   繁体   中英

How to do a COUNT(DISTINCT) using window functions with a frame in SQL Server

Piggybacking this lovely question: Partition Function COUNT() OVER possible using DISTINCT

I wish to calculate a moving count of distinct value. Something along the lines of:

Count(distinct machine_id) over(partition by model order by _timestamp rows between 6 preceding and current row)

Obviously, SQL Server does not support the syntax. Unfortunately, I don't understand well enough (didn't internalize would be more accurate) how that dense_rank walk-around works:

dense_rank() over (partition by model order by machine_id) 
+ dense_rank() over (partition by model order by machine_id) 
- 1

and therefore I am not able tweak it to meet my need for a moving window. If I order by machine_id, would it be enough to order by _timestamp as well and use rows between ?

dense_rank() gives the dense ranking of the the current record. When you run that with ASC sort order first, you get the current record's dense rank (unique value rank) from the first element. When you run with DESC order, then you get the current record's dense rank from the last record. Then you remove 1 because the dense ranking of the current record is counted twice. This gives the total unique values in the whole partition (and repeated for every row).

Since, dense_rank does not support frames , you can't use this solution directly. You need to generate the frame by other means. One way could be JOIN ing the same table with proper unique id comparisons. Then, you can use dense_rank on the combined version.

Please check out the following solution proposal. The assumption there is you have a unique record key ( record_id ) available in your table. If you don't have a unique key, add another CTE before the first CTE and generate a unique key for each record (using new_id() function OR combining multiple columns using concat() with delimiter in between to account for NULLs )

; WITH cte AS (
SELECT 
  record_id
  , record_id_6_record_earlier = LEAD(machine_id, 6, NULL) OVER (PARTITION BY model ORDER BY _timestamp)
  , .... other columns
FROM mainTable
)
, cte2 AS (
SELECT 
  c.*
  , DistinctCntWithin6PriorRec = dense_rank() OVER (PARTITION BY c.model, c.record_id ORDER BY t._timestamp)
    + dense_rank() OVER (PARTITION BY c.model, c.record_id ORDER BY t._timestamp DESC)
    - 1
  , RN = ROW_NUMBER() OVER (PARTITION BY c.record_id ORDER BY t._timestamp )
FROM cte c
     LEFT JOIN mainTable t ON t.record_id BETWEEN c.record_id_6_record_earlier  and c.record_id
)
SELECT *
FROM cte2
WHERE RN = 1

There are 2 LIMITATIONS of this solution:

  1. If the frame has less than 6 records, then the LAG() function will be NULL and thus this solution will not work. This can be handled in different ways: One quick way I can think of is to generate 6 LEAD columns (1 record prior, 2 records prior, etc.) and then change the BETWEEN clause to something like this BETWEEN COALESCE(c.record_id_6_record_earlier, c.record_id_5_record_earlier, ...., c.record_id_1_record_earlier, c.record_id) and c.record_id

  2. COUNT() does not count NULL . But DENSE_RANK does. You need account for that too if it applies to your data

Just use outer apply :

select t.*, t2.num_machines
from t outer apply
     (select count(distinct t2.machine_id) as num_machines
      from (select top (6) t2.*
            from t t2
            where t2.model = t.model and
                  t2.timestamp <= t.timestamp
            order by t2.timestamp desc
           ) t2
      ) t2;

If you have a lot of rows per model, you can also use a (cumbersome) trick using lag() :

select t.*, v.num_machines
from (select t.*,
             lag(machine_id, 1) over (partition by model order by timestamp) as machine_id_1,
             lag(machine_id, 2) over (partition by model order by timestamp) as machine_id_2,
             lag(machine_id, 3) over (partition by model order by timestamp) as machine_id_3,
             lag(machine_id, 4) over (partition by model order by timestamp) as machine_id_4,
             lag(machine_id, 5) over (partition by model order by timestamp) as machine_id_5
      from t
     ) t cross apply
     (select count(distinct v.machine_id) as num_machines
      from (values (t.machine_id),
                   (t.machine_id_1),
                   (t.machine_id_2),
                   (t.machine_id_3),
                   (t.machine_id_4),
                   (t.machine_id_5)
           ) v(machine_id)
      ) v;

Under many circumstances, this might have the best performance in SQL Server.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM