[英]Rolling COUNT DISTINCT of n-day active users using T-SQL
I am counting 7-day active users using T-SQL .我正在计算使用T-SQL的 7 天活跃用户。 I used the following code:
我使用了以下代码:
SELECT
*,
COUNT(DISTINCT [UserID]) OVER (
PARTITION BY [HospitalID], [HospitalName], [Device]
ORDER BY [Date]
ROWS 7 PRECEDING
) AS [7-Day Active Users]
FROM UserActivity
ORDER BY [HospitalID], [HospitalName], [Device], [Date]
I was told Use of DISTINCT is not allowed with the OVER clause
.有人告诉我
Use of DISTINCT is not allowed with the OVER clause
。 UserActivity
is a table with columns HospitalID
, HospitalName
, Device
(either phone or tablet), Date
and UserID
(could be NULL). UserActivity
是一个表,其中包含HospitalID
、 HospitalName
、 Device
(手机或平板电脑)、 Date
和UserID
(可能为 NULL)列。 To make things easier, I have filled the gaps between dates which made Date
consecutive so I can use ROWS 7 PRECEDING
with confidence.为了让事情变得更容易,我已经填补了使
Date
连续的日期之间的空白,因此我可以放心地使用ROWS 7 PRECEDING
。 I did a lot of searches online and found most solution are either using other types of SQL (which is not possible in my case) or using DENSE_RANK
function which does not support a moving window.我在网上做了很多搜索,发现大多数解决方案要么使用其他类型的 SQL(这在我的情况下是不可能的),要么使用不支持移动窗口的
DENSE_RANK
函数。 What is the correct and hopefully simpler, concise way of solving my problem?解决我的问题的正确且希望更简单,简洁的方法是什么?
Sample Data: https://docs.google.com/spreadsheets/d/19vrBK8ixpiPJycRjb1ekiKnEUYk5AaUH/edit?usp=sharing&ouid=110206477774349430845&rtpof=true&sd=true示例数据: https ://docs.google.com/spreadsheets/d/19vrBK8ixpiPJycRjb1ekiKnEUYk5AaUH/edit?usp=sharing&ouid=110206477774349430845&rtpof=true&sd=true
Sorry to see that COUNT DISTINCT
was not supported in that type of SQL... I hadn't known that.很抱歉看到那种类型的 SQL 不支持
COUNT DISTINCT
......我不知道。 Especially after you went to the trouble of fixing the gaps between dates!尤其是在您费心修复日期之间的差距之后!
I used Rasgo to generate the SQL -- so this won't work directly in your version (tested with Snowflake), but I think it will work as long as you fix the DATEADD
function.我使用Rasgo生成 SQL ——所以这不会直接在你的版本中工作(用雪花测试),但我认为只要你修复
DATEADD
函数它就会工作。 Every RDBMS seems to do DATEADD
differently, it seems.每个 RDBMS 似乎都以不同的方式执行
DATEADD
。
The general concept here is to join the data upon itself using a range join
condition in the WHERE
clause.这里的一般概念是使用
WHERE
子句中的range join
条件将数据连接到自身上。
Luckily, this should work for you without having to fix the gaps in the dates first.幸运的是,这应该对您有用,而无需先修复日期中的空白。
WITH BASIC_OFFSET_7DAY AS (
SELECT
A.HOSPITALNAME,
A.HOSPITALID,
A.DEVICE,
A.DATE,
COUNT(DISTINCT B.USERID) as COUNT_DISTINCT_USERID_PAST7DAY,
COUNT(1) AS AGG_ROW_COUNT
FROM
UserActivity A
INNER JOIN UserActivity B ON A.HOSPITALNAME = B.HOSPITALNAME
AND A.HOSPITALID = B.HOSPITALID
AND A.DEVICE = B.DEVICE
WHERE
B.DATE >= DATEADD(day, -7, A.DATE)
AND B.DATE <= A.DATE
GROUP BY
A.HOSPITALNAME,
A.HOSPITALID,
A.DEVICE,
A.DATE
)
SELECT
src.*,
BASIC_OFFSET_7DAY.COUNT_DISTINCT_USERID_PAST7DAY
FROM
UserActivity src
LEFT OUTER JOIN BASIC_OFFSET_7DAY ON BASIC_OFFSET_7DAY.DATE = src.DATE
AND BASIC_OFFSET_7DAY.HOSPITALNAME = src.HOSPITALNAME
AND BASIC_OFFSET_7DAY.HOSPITALID = src.HOSPITALID
AND BASIC_OFFSET_7DAY.DEVICE = src.DEVICE
Let me know how that works out and if it doesn't work I'll help you out.让我知道它是如何工作的,如果它不起作用,我会帮助你。
Edit: For those who are trying to do this and getting stuck, a common mistake (one that I myself performed when I did this by hand) is to pay careful attention to COUNT(DISTINCT(B.col)) and not A.col.编辑:对于那些试图这样做并陷入困境的人,一个常见的错误(我自己手动执行此操作时犯的一个)是要特别注意 COUNT(DISTINCT(B.col)) 而不是 A.col . When I used Rasgo to generate the SQL to check myself, I caught my mistake.
当我使用 Rasgo 生成 SQL 来检查自己时,我发现了我的错误。 Hopefully this note helps someone in the future who makes this same mistake!
希望这篇笔记能帮助将来犯同样错误的人!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.