简体   繁体   English

使用 T-SQL 滚动 n 天活跃用户的 COUNT DISTINCT

[英]Rolling COUNT DISTINCT of n-day active users using T-SQL

I am counting 7-day active users using T-SQL .我正在计算使用T-SQL的 7 天活跃用户。 I used the following code:我使用了以下代码:

SELECT 
    *, 
    COUNT(DISTINCT [UserID]) OVER (
        PARTITION BY [HospitalID], [HospitalName], [Device]
        ORDER BY [Date]
        ROWS 7 PRECEDING
    ) AS [7-Day Active Users]
FROM UserActivity
ORDER BY [HospitalID], [HospitalName], [Device], [Date]

I was told Use of DISTINCT is not allowed with the OVER clause .有人告诉我Use of DISTINCT is not allowed with the OVER clause UserActivity is a table with columns HospitalID , HospitalName , Device (either phone or tablet), Date and UserID (could be NULL). UserActivity是一个表,其中包含HospitalIDHospitalNameDevice (手机或平板电脑)、 DateUserID (可能为 NULL)列。 To make things easier, I have filled the gaps between dates which made Date consecutive so I can use ROWS 7 PRECEDING with confidence.为了让事情变得更容易,我已经填补了使Date连续的日期之间的空白,因此我可以放心地使用ROWS 7 PRECEDING I did a lot of searches online and found most solution are either using other types of SQL (which is not possible in my case) or using DENSE_RANK function which does not support a moving window.我在网上做了很多搜索,发现大多数解决方案要么使用其他类型的 SQL(这在我的情况下是不可能的),要么使用不支持移动窗口的DENSE_RANK函数。 What is the correct and hopefully simpler, concise way of solving my problem?解决我的问题的正确且希望更简单,简洁的方法是什么?

Sample Data: https://docs.google.com/spreadsheets/d/19vrBK8ixpiPJycRjb1ekiKnEUYk5AaUH/edit?usp=sharing&ouid=110206477774349430845&rtpof=true&sd=true示例数据: https ://docs.google.com/spreadsheets/d/19vrBK8ixpiPJycRjb1ekiKnEUYk5AaUH/edit?usp=sharing&ouid=110206477774349430845&rtpof=true&sd=true

Sorry to see that COUNT DISTINCT was not supported in that type of SQL... I hadn't known that.很抱歉看到那种类型的 SQL 不支持COUNT DISTINCT ......我不知道。 Especially after you went to the trouble of fixing the gaps between dates!尤其是在您费心修复日期之间的差距之后!

I used Rasgo to generate the SQL -- so this won't work directly in your version (tested with Snowflake), but I think it will work as long as you fix the DATEADD function.我使用Rasgo生成 SQL ——所以这不会直接在你的版本中工作(用雪花测试),但我认为只要你修复DATEADD函数它就会工作。 Every RDBMS seems to do DATEADD differently, it seems.每个 RDBMS 似乎都以不同的方式执行DATEADD

The general concept here is to join the data upon itself using a range join condition in the WHERE clause.这里的一般概念是使用WHERE子句中的range join条件将数据连接到自身上。

Luckily, this should work for you without having to fix the gaps in the dates first.幸运的是,这应该对您有用,而无需先修复日期中的空白。

WITH BASIC_OFFSET_7DAY AS (
  SELECT 
    A.HOSPITALNAME, 
    A.HOSPITALID, 
    A.DEVICE, 
    A.DATE, 
    COUNT(DISTINCT B.USERID) as COUNT_DISTINCT_USERID_PAST7DAY, 
    COUNT(1) AS AGG_ROW_COUNT 
  FROM 
    UserActivity A 
    INNER JOIN UserActivity B ON A.HOSPITALNAME = B.HOSPITALNAME 
    AND A.HOSPITALID = B.HOSPITALID 
    AND A.DEVICE = B.DEVICE 
  WHERE 
    B.DATE >= DATEADD(day, -7, A.DATE) 
    AND B.DATE <= A.DATE 
  GROUP BY 
    A.HOSPITALNAME, 
    A.HOSPITALID, 
    A.DEVICE, 
    A.DATE
) 
SELECT 
  src.*, 
  BASIC_OFFSET_7DAY.COUNT_DISTINCT_USERID_PAST7DAY 
FROM 
  UserActivity src 
  LEFT OUTER JOIN BASIC_OFFSET_7DAY ON BASIC_OFFSET_7DAY.DATE = src.DATE 
  AND BASIC_OFFSET_7DAY.HOSPITALNAME = src.HOSPITALNAME 
  AND BASIC_OFFSET_7DAY.HOSPITALID = src.HOSPITALID 
  AND BASIC_OFFSET_7DAY.DEVICE = src.DEVICE

Let me know how that works out and if it doesn't work I'll help you out.让我知道它是如何工作的,如果它不起作用,我会帮助你。

Edit: For those who are trying to do this and getting stuck, a common mistake (one that I myself performed when I did this by hand) is to pay careful attention to COUNT(DISTINCT(B.col)) and not A.col.编辑:对于那些试图这样做并陷入困境的人,一个常见的错误(我自己手动执行此操作时犯的一个)是要特别注意 COUNT(DISTINCT(B.col)) 而不是 A.col . When I used Rasgo to generate the SQL to check myself, I caught my mistake.当我使用 Rasgo 生成 SQL 来检查自己时,我发现了我的错误。 Hopefully this note helps someone in the future who makes this same mistake!希望这篇笔记能帮助将来犯同样错误的人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM