简体   繁体   English

如何在SQL Server中的滑动窗口上聚合(计算不同的项目)?

[英]How to aggregate (counting distinct items) over a sliding window in SQL Server?

I am currently using this query (in SQL Server) to count the number of unique item each day: 我目前正在使用此查询(在SQL Server中)每天计算唯一项目的数量:

SELECT Date, COUNT(DISTINCT item) 
FROM myTable 
GROUP BY Date 
ORDER BY Date

How can I transform this to get for each date the number of unique item over the past 3 days (including the current day)? 如何对此进行转换以获取过去3天 (包括当天)中每个日期的唯一商品数量

The output should be a table with 2 columns: one columns with all dates in the original table. 输出应该是一个包含2列的表:一列包含原始表中的所有日期。 On the second column, we have the number of unique item per date. 在第二列,我们有每个日期的唯一项目数。

for instance if original table is: 例如,如果原始表是:

Date        Item  
01/01/2018  A  
01/01/2018  B  
02/01/2018  C  
03/01/2018  C    
04/01/2018  C

With my query above I currently get the unique count for each day: 根据我上面的查询,我目前获得每天的唯一计数:

Date        count  
01/01/2018  2  
02/01/2018  1  
03/01/2018  1  
04/01/2018  1

and I am looking to get as result the unique count over 3 days rolling window: 我希望得到3天滚动窗口的独特计数:

Date        count  
01/01/2018  2  
02/01/2018  3  (because items ABC on 1st and 2nd Jan)
03/01/2018  3  (because items ABC on 1st,2nd,3rd Jan)    
04/01/2018  1  (because only item C on 2nd,3rd,4th Jan)    

Using an apply provides a convenient way to form sliding windows 使用apply提供了一种形成滑动窗口的便捷方式

CREATE TABLE myTable 
    ([DateCol] datetime, [Item] varchar(1))
;

INSERT INTO myTable 
    ([DateCol], [Item])
VALUES
    ('2018-01-01 00:00:00', 'A'),
    ('2018-01-01 00:00:00', 'B'),
    ('2018-01-02 00:00:00', 'C'),
    ('2018-01-03 00:00:00', 'C'),
    ('2018-01-04 00:00:00', 'C')
;

CREATE NONCLUSTERED INDEX IX_DateCol  
    ON MyTable([Date])  
;    

Query : 查询

select distinct 
       t1.dateCol
     , oa.ItemCount
from myTable t1
outer apply (
      select count(distinct t2.item) as ItemCount
      from myTable t2
      where t2.DateCol between dateadd(day,-2,t1.DateCol) and t1.DateCol
  ) oa
order by t1.dateCol ASC

Results : 结果

|              dateCol | ItemCount |
|----------------------|-----------|
| 2018-01-01T00:00:00Z |         2 |
| 2018-01-02T00:00:00Z |         3 |
| 2018-01-03T00:00:00Z |         3 |
| 2018-01-04T00:00:00Z |         1 |

There may be some performance gains by reducing the date column prior to using the apply , like so: 通过在使用apply之前减少date列可能会有一些性能提升,如下所示:

select 
       d.date
     , oa.ItemCount
from (
    select distinct t1.date
    from myTable t1
     ) d
outer apply (
      select count(distinct t2.item) as ItemCount
      from myTable t2
      where t2.Date between dateadd(day,-2,d.Date) and d.Date
  ) oa
order by d.date ASC
;

Instead of using select distinct in that subquery you could use group by instead but the execution plan will remain the same. 您可以使用group by而不是在子查询中使用select distinct ,但执行计划将保持不变。

Demo at SQL Fiddle 在SQL Fiddle演示

The most straight forward solution is to join the table with itself based on dates: 最直接的解决方案是根据日期加入表格:

SELECT t1.DateCol, COUNT(DISTINCT t2.Item) AS C
FROM testdata AS t1 
LEFT JOIN testdata AS t2 ON t2.DateCol BETWEEN DATEADD(dd, -2, t1.DateCol) AND t1.DateCol
GROUP BY t1.DateCol
ORDER BY t1.DateCol

Output: 输出:

| DateCol                 | C |
|-------------------------|---|
| 2018-01-01 00:00:00.000 | 2 |
| 2018-01-02 00:00:00.000 | 3 |
| 2018-01-03 00:00:00.000 | 3 |
| 2018-01-04 00:00:00.000 | 1 |

GROUP BY should be faster then DISTINCT (make sure to have an index on your Date column) GROUP BY应该比DISTINCT快(确保在Date列上有索引)

DECLARE @tbl TABLE([Date] DATE, [Item] VARCHAR(100))
;

INSERT INTO @tbl  VALUES
    ('2018-01-01 00:00:00', 'A'),
    ('2018-01-01 00:00:00', 'B'),
    ('2018-01-02 00:00:00', 'C'),
    ('2018-01-03 00:00:00', 'C'),
    ('2018-01-04 00:00:00', 'C');

SELECT t.[Date]

      --Just for control. You can take this part away
      ,(SELECT DISTINCT t2.[Item] AS [*]
        FROM @tbl AS t2
        WHERE t2.[Date]<=t.[Date] 
          AND t2.[Date]>=DATEADD(DAY,-2,t.[Date]) FOR XML PATH('')) AS CountedItems

      --This sub-select comes back with your counts 
      ,(SELECT COUNT(DISTINCT t2.[Item])
        FROM @tbl AS t2
        WHERE t2.[Date]<=t.[Date] 
          AND t2.[Date]>=DATEADD(DAY,-2,t.[Date])) AS ItemCount
FROM @tbl AS t
GROUP BY t.[Date];

The result 结果

Date        CountedItems    ItemCount
2018-01-01  AB              2
2018-01-02  ABC             3
2018-01-03  ABC             3
2018-01-04  C               1

This solution is different from other solutions. 该解决方案与其他解决方案不同。 Can you check performance of this query on real data with comparison to other answers? 你可以通过与其他答案的比较来检查这个查询在真实数据上的表现吗?

The basic idea is that each row can participate in the window for its own date, the day after, or the day after that. 基本思想是每行可以在其自己的日期,后一天或后一天参与窗口。 So this first expands the row out into three rows with those different dates attached and then it can just use a regular COUNT(DISTINCT) aggregating on the computed date. 因此,首先将行扩展为三行,并附加不同的日期,然后它可以在计算日期使用常规COUNT(DISTINCT)聚合。 The HAVING clause is just to avoid returning results for dates that were solely computed and not present in the base data. HAVING子句只是为了避免返回单独计算并且不存在于基础数据中的日期的结果。

with cte(Date, Item) as (
    select cast(a as datetime), b 
    from (values 
        ('01/01/2018','A')
        ,('01/01/2018','B')
        ,('02/01/2018','C')
        ,('03/01/2018','C')
        ,('04/01/2018','C')) t(a,b)
)

select 
    [Date] = dateadd(dd, n, Date), [Count] = count(distinct Item)
from 
    cte
    cross join (values (0),(1),(2)) t(n)
group by dateadd(dd, n, Date)
having max(iif(n = 0, 1, 0)) = 1

option (force order)

Output: 输出:

|        Date             | Count |
|-------------------------|-------|
| 2018-01-01 00:00:00.000 |   2   |
| 2018-01-02 00:00:00.000 |   3   |
| 2018-01-03 00:00:00.000 |   3   |
| 2018-01-04 00:00:00.000 |   1   |

It might be faster if you have many duplicate rows: 如果您有许多重复行可能会更快:

select 
    [Date] = dateadd(dd, n, Date), [Count] = count(distinct Item)
from 
    (select distinct Date, Item from cte) c
    cross join (values (0),(1),(2)) t(n)
group by dateadd(dd, n, Date)
having max(iif(n = 0, 1, 0)) = 1

option (force order)

Use GETDATE() function to get current date, and DATEADD() to get the last 3 days 使用GETDATE()函数获取当前日期,使用DATEADD()获取最近3天

 SELECT Date, count(DISTINCT item) 
 FROM myTable 
 WHERE [Date] >= DATEADD(day,-3, GETDATE())
 GROUP BY Date 
 ORDER BY Date

SQL SQL

SELECT DISTINCT Date,
       (SELECT COUNT(DISTINCT item)
        FROM myTable t2
        WHERE t2.Date BETWEEN DATEADD(day, -2, t1.Date) AND t1.Date) AS count
FROM myTable t1
ORDER BY Date;

Demo 演示

Rextester demo: http://rextester.com/ZRDQ22190 Rextester演示: http ://rextester.com/ZRDQ22190

Since COUNT(DISTINCT item) OVER (PARTITION BY [Date]) is not supported you can use dense_rank to emulate that: 由于不支持COUNT(DISTINCT item) OVER (PARTITION BY [Date])您可以使用dense_rank来模拟:

SELECT Date, dense_rank() over (partition by [Date] order by [item]) 
+ dense_rank() over (partition by [Date] order by [item] desc) 
- 1 as count_distinct_item
FROM myTable 

One thing to note is that dense_rank will count null as whereas COUNT will not. 需要注意的一点是, dense_rank将计为null,而COUNT则不计算。

Refer this post for more details. 请参阅职位的更多细节。

Here is a simple solution that uses myTable itself as the source of grouping dates (edited for SQLServer dateadd). 这是一个简单的解决方案,它使用myTable本身作为分组日期的来源(为SQLServer dateadd编辑)。 Note that this query assumes there will be at least one record in myTable for every date; 请注意,此查询假定myTable中每个日期至少会有一条记录; if any date is absent, it will not appear in the query results, even if there are records for the 2 days prior: 如果没有任何日期,即使前两天有记录,它也不会出现在查询结果中:

select
    date,
    (select
        count(distinct item)
        from (select distinct date, item from myTable) as d2
     where
        d2.date between dateadd(day,-2,d.date) and d.date
    ) as count
from (select distinct date from myTable) as d

I solve this question with Math. 我用Math解决了这个问题。

z (any day) = 3x + y (y is mode 3 value) I need from 3 * (x - 1) + y + 1 to 3 * (x - 1) + y + 3 z(任何一天)= 3x + y(y是模式3值)我需要从3 *(x - 1)+ y + 1到3 *(x - 1)+ y + 3

3 * (x- 1) + y + 1 = 3* (z / 3 - 1) + z % 3 + 1 3 *(x-1)+ y + 1 = 3 *(z / 3-1)+ z%3 + 1

In that case; 在这种情况下; I can use group by (between 3* (z / 3 - 1) + z % 3 + 1 and z) 我可以使用group by(在3 *(z / 3 - 1)+ z%3 + 1和z之间)

    SELECT  iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)
, count(sh.SalesOrderID) FROM Sales.SalesOrderDetail shd
JOIN Sales.SalesOrderHeader sh on sh.SalesOrderID = shd.SalesOrderID
group by iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)
order by iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)

If you need else day group, you can use; 如果你需要其他日组,你可以使用;

declare @n int = 4 (another day count)

SELECT  iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)
, count(sh.SalesOrderID) FROM Sales.SalesOrderDetail shd
JOIN Sales.SalesOrderHeader sh on sh.SalesOrderID = shd.SalesOrderID
group by iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)
order by iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM