I am trying to write a query that gets the cumulative user count over the course of a month.
WITH USERS_PER_DAY AS (
SELECT
DATE_TRUNC('day', HOUR_DIM.UTC) DAY
, COUNT(DISTINCT CLIENT_SID) ACTIVE_USER_COUNT
FROM RPT.S_HOURLY_INACTIVE_TVS_AGG
WHERE DATEDIFF('month', HOUR_DIM.UTC, CURRENT_DATE) BETWEEN 0 AND 0
GROUP BY
DATE_TRUNC('day', HOUR_DIM.UTC)
)
SELECT
DAY,
SUM(ACTIVE_USER_COUNT) OVER (PARTITION BY APP_NAME ORDER BY DAY ASC rows between unbounded preceding and current row) CUMULATIVE_ACTIVE_USER_ACOUNT
FROM USERS_PER_DAY
The output now looks like this:
The problem is that I need a count of distinct or unique users for the month, but this query contains duplication in users between days. I know that I can't use a count(distinct ...) in my window function but is there another way to ensure that I don't have duplication in users between days?
So a naive solution is to turn the data to distinct days, and distinct users per day, and then join those to CTE to get the results:
WITH data AS (
select
hour_dim_utc::timestamp_ntz as hour_dim_utc
,user_id
from values
('2020-03-10 9:50', 1 ),
('2020-03-10 9:51', 3 ),
('2020-03-10 10:51', 3 ),
('2020-03-11 9:52', 1 ),
('2020-03-11 9:53', 2 ),
('2020-03-11 9:54', 0 ),
('2020-03-12 9:55', 0 ),
('2020-03-12 9:56', 1 ),
('2020-03-12 9:57', 3 ),
('2020-03-14 9:58', 2 ),
('2020-03-15 9:59', 3 ),
('2020-03-16 10:00', 2 ),
('2020-03-17 10:01', 2 ),
('2020-03-18 10:02', 0 ),
('2020-03-19 10:04', 11 )
s( hour_dim_utc, user_id)
), distinct_users_days AS (
select distinct
hour_dim_utc::date as day
,user_id
from data
), distinct_days AS (
select distinct
hour_dim_utc::date as day
from data
)
select
a.day
,count(distinct(u.user_id)) as acum_count
from distinct_days as a
join distinct_users_days as u on u.day <= a.day
group by 1 order by 1;
gives:
DAY ACUM_COUNT
2020-03-10 2
2020-03-11 4
2020-03-12 4
2020-03-14 4
2020-03-15 4
2020-03-16 4
2020-03-17 4
2020-03-18 4
2020-03-19 5
in your SQL you do WHERE DATEDIFF('month', HOUR_DIM.UTC, CURRENT_DATE) BETWEEN 0 AND 0
it would be more readable and performant to say WHERE hour_dim.utc >= DATE_TRUNC('month', CURRENT_DATE)
The "clever" approach to this is to use the sum of dense_rank()
s:
SELECT first_day, APP_NAME,
SUM(COUNT(*)) OVER (PARTITION BY APP_NAME ORDER BY first_day ASC) as CUMULATIVE_ACTIVE_USER_ACOUNT
FROM (SELECT CLIENT_SID, APP_NAME,
MIN(DATE_TRUNC('day', HOUR_DIM.UTC)) as first_day
FROM RPT.S_HOURLY_INACTIVE_TVS_AGG
WHERE DATEDIFF('month', HOUR_DIM.UTC, CURRENT_DATE) BETWEEN 0 AND 0
GROUP BY CLIENT_SID, APP_NAME
) cs
GROUP BY first_day, APP_NAME;
Gordon's update answer is good if you have enough data that every day, get a user that has a first day for each day in the month, but when the data is sparse like my example data, you don't get the results you expect
Gordon's code is effectively this:
WITH data AS (
select hour_dim_utc::timestamp_ntz as hour_dim_utc, user_id from values
('2020-03-10 9:50', 1 ),
('2020-03-10 9:51', 3 ),
('2020-03-10 10:51', 3 ),
('2020-03-11 9:52', 1 ),
('2020-03-11 9:53', 2 ),
('2020-03-11 9:54', 0 ),
('2020-03-12 9:55', 0 ),
('2020-03-12 9:56', 1 ),
('2020-03-12 9:57', 3 ),
('2020-03-14 9:58', 2 ),
('2020-03-15 9:59', 3 ),
('2020-03-16 10:00', 2 ),
('2020-03-17 10:01', 2 ),
('2020-03-18 10:02', 0 ),
('2020-03-19 10:04', 11 )
s( hour_dim_utc, user_id)
)
select
first_day
,sum(count(*)) over (ORDER BY first_day ASC) as acum
from (
select user_id
,min(hour_dim_utc::date) as first_day
from data
group by 1
) group by 1;
which gives:
FIRST_DAY ACUM
2020-03-10 2
2020-03-11 4
2020-03-19 5
I know this is old, but hopefully, this will help anyone looking for something similar.
If you look at the last post from the OP, there is no March 13th. As Simon mentioned, his data is sparse. To have one entry for every day, create a date spine. Using the SQL from the last post, I called a table that has an entry for each day (I called it DATE_KEY in the example below). Since those tables tend to go very far back or far forward, I queried the initial dataset for min() and max() values to limit the rows returned from the date table.
I left the first_day field in the query but commented out so you could uncomment it to see the relationship of the date spine to the date returned from your dataset.
WITH
dates AS (
SELECT DATE_KEY
FROM my_date_table
)
,data AS (
select hour_dim_utc::timestamp_ntz as hour_dim_utc, user_id from values
('2020-03-10 9:50', 1 ),
('2020-03-10 9:51', 3 ),
('2020-03-10 10:51', 3 ),
('2020-03-11 9:52', 1 ),
('2020-03-11 9:53', 2 ),
('2020-03-11 9:54', 0 ),
('2020-03-12 9:55', 0 ),
('2020-03-12 9:56', 1 ),
('2020-03-12 9:57', 3 ),
('2020-03-14 9:58', 2 ),
('2020-03-15 9:59', 3 ),
('2020-03-16 10:00', 2 ),
('2020-03-17 10:01', 2 ),
('2020-03-18 10:02', 0 ),
('2020-03-19 10:04', 11 )
s( hour_dim_utc, user_id)
)
,RANGES as (
SELECT
min(hour_dim_utc::date) AS min_day
,max(hour_dim_utc::date) AS max_day
FROM data
)
, first_days AS (
select
first_day
,sum(count(*)) over (ORDER BY first_day ASC) as acum
from (
select user_id
,min(hour_dim_utc::date) as first_day
from data
group by 1
) group by 1
)
SELECT
D.DATE_KEY
-- ,FD.FIRST_DAY
,sum(FD.ACUM) over (ORDER BY DATE_KEY ASC) AS ACUM
FROM DATES D
inner join ranges ON d.date_key >= ranges.min_day and d.date_key <= ranges.max_day
LEFT JOIN FIRST_DAYS FD ON D.DATE_KEY = FD.FIRST_DAY
which results in
+------------+------+
| DATE_KEY | ACUM |
+------------+------+
| 2020-03-10 | 2 |
| 2020-03-11 | 6 |
| 2020-03-12 | 6 |
| 2020-03-13 | 6 |
| 2020-03-14 | 6 |
| 2020-03-15 | 6 |
| 2020-03-16 | 6 |
| 2020-03-17 | 6 |
| 2020-03-18 | 6 |
| 2020-03-19 | 11 |
+------------+------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.