简体   繁体   中英

How to aggregate (counting distinct items) over a sliding window in SQL Server?

I am currently using this query (in SQL Server) to count the number of unique item each day:

SELECT Date, COUNT(DISTINCT item) 
FROM myTable 
GROUP BY Date 
ORDER BY Date

How can I transform this to get for each date the number of unique item over the past 3 days (including the current day)?

The output should be a table with 2 columns: one columns with all dates in the original table. On the second column, we have the number of unique item per date.

for instance if original table is:

Date        Item  
01/01/2018  A  
01/01/2018  B  
02/01/2018  C  
03/01/2018  C    
04/01/2018  C

With my query above I currently get the unique count for each day:

Date        count  
01/01/2018  2  
02/01/2018  1  
03/01/2018  1  
04/01/2018  1

and I am looking to get as result the unique count over 3 days rolling window:

Date        count  
01/01/2018  2  
02/01/2018  3  (because items ABC on 1st and 2nd Jan)
03/01/2018  3  (because items ABC on 1st,2nd,3rd Jan)    
04/01/2018  1  (because only item C on 2nd,3rd,4th Jan)    

Using an apply provides a convenient way to form sliding windows

CREATE TABLE myTable 
    ([DateCol] datetime, [Item] varchar(1))
;

INSERT INTO myTable 
    ([DateCol], [Item])
VALUES
    ('2018-01-01 00:00:00', 'A'),
    ('2018-01-01 00:00:00', 'B'),
    ('2018-01-02 00:00:00', 'C'),
    ('2018-01-03 00:00:00', 'C'),
    ('2018-01-04 00:00:00', 'C')
;

CREATE NONCLUSTERED INDEX IX_DateCol  
    ON MyTable([Date])  
;    

Query :

select distinct 
       t1.dateCol
     , oa.ItemCount
from myTable t1
outer apply (
      select count(distinct t2.item) as ItemCount
      from myTable t2
      where t2.DateCol between dateadd(day,-2,t1.DateCol) and t1.DateCol
  ) oa
order by t1.dateCol ASC

Results :

|              dateCol | ItemCount |
|----------------------|-----------|
| 2018-01-01T00:00:00Z |         2 |
| 2018-01-02T00:00:00Z |         3 |
| 2018-01-03T00:00:00Z |         3 |
| 2018-01-04T00:00:00Z |         1 |

There may be some performance gains by reducing the date column prior to using the apply , like so:

select 
       d.date
     , oa.ItemCount
from (
    select distinct t1.date
    from myTable t1
     ) d
outer apply (
      select count(distinct t2.item) as ItemCount
      from myTable t2
      where t2.Date between dateadd(day,-2,d.Date) and d.Date
  ) oa
order by d.date ASC
;

Instead of using select distinct in that subquery you could use group by instead but the execution plan will remain the same.

Demo at SQL Fiddle

The most straight forward solution is to join the table with itself based on dates:

SELECT t1.DateCol, COUNT(DISTINCT t2.Item) AS C
FROM testdata AS t1 
LEFT JOIN testdata AS t2 ON t2.DateCol BETWEEN DATEADD(dd, -2, t1.DateCol) AND t1.DateCol
GROUP BY t1.DateCol
ORDER BY t1.DateCol

Output:

| DateCol                 | C |
|-------------------------|---|
| 2018-01-01 00:00:00.000 | 2 |
| 2018-01-02 00:00:00.000 | 3 |
| 2018-01-03 00:00:00.000 | 3 |
| 2018-01-04 00:00:00.000 | 1 |

GROUP BY should be faster then DISTINCT (make sure to have an index on your Date column)

DECLARE @tbl TABLE([Date] DATE, [Item] VARCHAR(100))
;

INSERT INTO @tbl  VALUES
    ('2018-01-01 00:00:00', 'A'),
    ('2018-01-01 00:00:00', 'B'),
    ('2018-01-02 00:00:00', 'C'),
    ('2018-01-03 00:00:00', 'C'),
    ('2018-01-04 00:00:00', 'C');

SELECT t.[Date]

      --Just for control. You can take this part away
      ,(SELECT DISTINCT t2.[Item] AS [*]
        FROM @tbl AS t2
        WHERE t2.[Date]<=t.[Date] 
          AND t2.[Date]>=DATEADD(DAY,-2,t.[Date]) FOR XML PATH('')) AS CountedItems

      --This sub-select comes back with your counts 
      ,(SELECT COUNT(DISTINCT t2.[Item])
        FROM @tbl AS t2
        WHERE t2.[Date]<=t.[Date] 
          AND t2.[Date]>=DATEADD(DAY,-2,t.[Date])) AS ItemCount
FROM @tbl AS t
GROUP BY t.[Date];

The result

Date        CountedItems    ItemCount
2018-01-01  AB              2
2018-01-02  ABC             3
2018-01-03  ABC             3
2018-01-04  C               1

This solution is different from other solutions. Can you check performance of this query on real data with comparison to other answers?

The basic idea is that each row can participate in the window for its own date, the day after, or the day after that. So this first expands the row out into three rows with those different dates attached and then it can just use a regular COUNT(DISTINCT) aggregating on the computed date. The HAVING clause is just to avoid returning results for dates that were solely computed and not present in the base data.

with cte(Date, Item) as (
    select cast(a as datetime), b 
    from (values 
        ('01/01/2018','A')
        ,('01/01/2018','B')
        ,('02/01/2018','C')
        ,('03/01/2018','C')
        ,('04/01/2018','C')) t(a,b)
)

select 
    [Date] = dateadd(dd, n, Date), [Count] = count(distinct Item)
from 
    cte
    cross join (values (0),(1),(2)) t(n)
group by dateadd(dd, n, Date)
having max(iif(n = 0, 1, 0)) = 1

option (force order)

Output:

|        Date             | Count |
|-------------------------|-------|
| 2018-01-01 00:00:00.000 |   2   |
| 2018-01-02 00:00:00.000 |   3   |
| 2018-01-03 00:00:00.000 |   3   |
| 2018-01-04 00:00:00.000 |   1   |

It might be faster if you have many duplicate rows:

select 
    [Date] = dateadd(dd, n, Date), [Count] = count(distinct Item)
from 
    (select distinct Date, Item from cte) c
    cross join (values (0),(1),(2)) t(n)
group by dateadd(dd, n, Date)
having max(iif(n = 0, 1, 0)) = 1

option (force order)

Use GETDATE() function to get current date, and DATEADD() to get the last 3 days

 SELECT Date, count(DISTINCT item) 
 FROM myTable 
 WHERE [Date] >= DATEADD(day,-3, GETDATE())
 GROUP BY Date 
 ORDER BY Date

SQL

SELECT DISTINCT Date,
       (SELECT COUNT(DISTINCT item)
        FROM myTable t2
        WHERE t2.Date BETWEEN DATEADD(day, -2, t1.Date) AND t1.Date) AS count
FROM myTable t1
ORDER BY Date;

Demo

Rextester demo: http://rextester.com/ZRDQ22190

Since COUNT(DISTINCT item) OVER (PARTITION BY [Date]) is not supported you can use dense_rank to emulate that:

SELECT Date, dense_rank() over (partition by [Date] order by [item]) 
+ dense_rank() over (partition by [Date] order by [item] desc) 
- 1 as count_distinct_item
FROM myTable 

One thing to note is that dense_rank will count null as whereas COUNT will not.

Refer this post for more details.

Here is a simple solution that uses myTable itself as the source of grouping dates (edited for SQLServer dateadd). Note that this query assumes there will be at least one record in myTable for every date; if any date is absent, it will not appear in the query results, even if there are records for the 2 days prior:

select
    date,
    (select
        count(distinct item)
        from (select distinct date, item from myTable) as d2
     where
        d2.date between dateadd(day,-2,d.date) and d.date
    ) as count
from (select distinct date from myTable) as d

I solve this question with Math.

z (any day) = 3x + y (y is mode 3 value) I need from 3 * (x - 1) + y + 1 to 3 * (x - 1) + y + 3

3 * (x- 1) + y + 1 = 3* (z / 3 - 1) + z % 3 + 1

In that case; I can use group by (between 3* (z / 3 - 1) + z % 3 + 1 and z)

    SELECT  iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)
, count(sh.SalesOrderID) FROM Sales.SalesOrderDetail shd
JOIN Sales.SalesOrderHeader sh on sh.SalesOrderID = shd.SalesOrderID
group by iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)
order by iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)

If you need else day group, you can use;

declare @n int = 4 (another day count)

SELECT  iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)
, count(sh.SalesOrderID) FROM Sales.SalesOrderDetail shd
JOIN Sales.SalesOrderHeader sh on sh.SalesOrderID = shd.SalesOrderID
group by iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)
order by iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM