[英]How do I calculate Survival Rate in SQL?
(方言可以是Vertica
、 Impala
或Databricks
)
我正在嘗試計算用戶的第 0 天、第 1 天……直到第 7 天的存活率。 我將某個日期的所有用戶視為 d0(無論他們是新用戶還是舊用戶),並查看其中有多少用戶在 d1、d2 等時間返回。假設我們有以下數據:
user | login_date
-----------------
001 | 2019-11-01
002 | 2019-11-01
003 | 2019-11-01
004 | 2019-11-01
005 | 2019-11-01
001 | 2019-11-02
003 | 2019-11-02
004 | 2019-11-02
006 | 2019-11-02
007 | 2019-11-02
002 | 2019-11-03
003 | 2019-11-03
004 | 2019-11-03
005 | 2019-11-03
008 | 2019-11-03
001 | 2019-11-04
002 | 2019-11-04
006 | 2019-11-04
007 | 2019-11-04
009 | 2019-11-04
我想看到這樣的東西:
date |d0 |d1 |d2 |d3
--------------------------
2019-11-01| 5 | 3 | 4 | 2
2019-11-02| 5 | 2 | 3 |
2019-11-03| 5 | 1
2019-11-04| 5
所以你可以看到 d0 是 5(即使有些用戶之前登錄過),例如我們在 2019-11-02 有001
、 003
、 004
、 006
、 007
,其中 2 個在第二天回來了。
現在我開發了一個接近我的目標的查詢,但不一樣。
WITH cte1 AS (
SELECT
user,
login_date,
FIRST_VALUE(login_date) OVER (PARTITION BY user ORDER BY login_date) AS first_login_day,
DATEDIFF(login_date, first_login_day) AS days_since_first_play
FROM
table
)
SELECT
first_login_day,
SUM(CASE WHEN days_since_first_play = 0 THEN 1 ELSE 0 END) AS d0,
SUM(CASE WHEN days_since_first_play = 1 THEN 1 ELSE 0 END) AS d1,
SUM(CASE WHEN days_since_first_play = 2 THEN 1 ELSE 0 END) AS d2,
SUM(CASE WHEN days_since_first_play = 3 THEN 1 ELSE 0 END) AS d3,
SUM(CASE WHEN days_since_first_play = 4 THEN 1 ELSE 0 END) AS d4,
SUM(CASE WHEN days_since_first_play = 5 THEN 1 ELSE 0 END) AS d5,
SUM(CASE WHEN days_since_first_play = 6 THEN 1 ELSE 0 END) AS d6,
SUM(CASE WHEN days_since_first_play = 7 THEN 1 ELSE 0 END) AS d7
FROM
cte1
GROUP BY
first_login_day
ORDER BY
first_login_day
查詢的問題在於它從我正在查看的日期中刪除了舊玩家。 例如,使用相同的數據,因為001
、 003
、 004
已經在 2019-11-01 登錄,所以 2019-11-02 的d0
值將是 2 而不是 5。所以這個查詢只有在我正在尋找時才有效僅限新用戶。
我想知道是否可以更改查詢以實現我想要的? 先謝謝了~~
這是一種公認的丑陋方式。 這個想法是標記每個user_id,如果他們是第1天,第2天,等等的返回者,然后通過login_date聚合。 希望看到一種更好的方法來做到這一點。
with offsets as (
select a.user_id
, a.login_date
, case when b.login_date is not null then 1 else 0 end day_plus_one
, case when c.login_date is not null then 1 else 0 end day_plus_two
, case when d.login_date is not null then 1 else 0 end day_plus_three
from table a
left join table b
on b.user_id = a.user_id
and b.login_date = a.login_date+1
left join table c
on c.user_id = a.user_id
and c.login_date = a.login_date+2
left join table d
on d.user_id = a.user_id
and d.login_date = a.login_date+3
order by a.user_id, a.login_date
)
select
login_date
, count(distinct user_id) day_zero_logins
, sum(day_plus_one) day_one_logins
, sum(day_plus_two) day_two_logins
, sum(day_plus_three) day_three_logins
from offsets
group by login_date
order by login_date
一些 self-left join 和不同的用戶計數會給出這樣的結果。
SELECT t0.login_date,
COUNT(distinct t0.user) as d0,
COUNT(distinct t1.user) as d1,
COUNT(distinct t2.user) as d2,
COUNT(distinct t3.user) as d3
FROM table t0
LEFT JOIN table t1
ON t1.user = t0.user
AND t1.login_date = t0.login_date + 1
LEFT JOIN table t2
ON t2.user = t0.user
AND t2.login_date = t0.login_date + 2
LEFT JOIN table t3
ON t3.user = t0.user
AND t3.login_date = t0.login_date + 3
GROUP BY t0.login_date
ORDER BY t0.login_date
但是如果login_date需要連接呢?
然后只需將 JOIN 標准更改為:
FROM table t0
LEFT JOIN table t1
ON t1.user = t0.user
AND t1.login_date = t0.login_date + 1
LEFT JOIN table t2
ON t2.user = t1.user
AND t2.login_date = t1.login_date + 1
LEFT JOIN table t3
ON t3.user = t2.user
AND t3.login_date = t2.login_date + 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.