简体   繁体   中英

Snowflake - Getting a Count of Distinct Users While Using Window Frame or an Order

I am trying to write a query that gets the cumulative user count over the course of a month.

WITH USERS_PER_DAY AS (
  SELECT 
    DATE_TRUNC('day', HOUR_DIM.UTC) DAY
  , COUNT(DISTINCT CLIENT_SID) ACTIVE_USER_COUNT
  FROM RPT.S_HOURLY_INACTIVE_TVS_AGG
  WHERE DATEDIFF('month', HOUR_DIM.UTC, CURRENT_DATE) BETWEEN 0 AND 0
  GROUP BY 
    DATE_TRUNC('day', HOUR_DIM.UTC) 
)
SELECT  
DAY,
SUM(ACTIVE_USER_COUNT) OVER (PARTITION BY APP_NAME ORDER BY DAY ASC rows between unbounded preceding and current row) CUMULATIVE_ACTIVE_USER_ACOUNT
FROM USERS_PER_DAY

The output now looks like this:

在此处输入图像描述

The problem is that I need a count of distinct or unique users for the month, but this query contains duplication in users between days. I know that I can't use a count(distinct ...) in my window function but is there another way to ensure that I don't have duplication in users between days?

So a naive solution is to turn the data to distinct days, and distinct users per day, and then join those to CTE to get the results:

WITH data AS (  
    select 
        hour_dim_utc::timestamp_ntz as hour_dim_utc
        ,user_id 
    from values
        ('2020-03-10 9:50', 1 ),
        ('2020-03-10 9:51', 3 ),
        ('2020-03-10 10:51', 3 ),
        ('2020-03-11 9:52', 1 ),
        ('2020-03-11 9:53', 2 ),
        ('2020-03-11 9:54', 0 ),
        ('2020-03-12 9:55', 0 ),
        ('2020-03-12 9:56', 1 ),
        ('2020-03-12 9:57', 3 ),
        ('2020-03-14 9:58', 2 ),
        ('2020-03-15 9:59', 3 ),
        ('2020-03-16 10:00', 2 ),
        ('2020-03-17 10:01', 2 ),
        ('2020-03-18 10:02', 0 ),
        ('2020-03-19 10:04', 11 )
         s( hour_dim_utc, user_id)
), distinct_users_days AS (
    select distinct 
        hour_dim_utc::date as day
        ,user_id
    from data
), distinct_days AS (
    select distinct 
        hour_dim_utc::date as day
    from data
)
select 
    a.day
    ,count(distinct(u.user_id)) as acum_count
from distinct_days as a
join distinct_users_days as u on u.day <= a.day
group by 1 order by 1;

gives:

DAY         ACUM_COUNT
2020-03-10  2
2020-03-11  4
2020-03-12  4
2020-03-14  4
2020-03-15  4
2020-03-16  4
2020-03-17  4
2020-03-18  4
2020-03-19  5

in your SQL you do WHERE DATEDIFF('month', HOUR_DIM.UTC, CURRENT_DATE) BETWEEN 0 AND 0 it would be more readable and performant to say WHERE hour_dim.utc >= DATE_TRUNC('month', CURRENT_DATE)

The "clever" approach to this is to use the sum of dense_rank() s:

SELECT first_day, APP_NAME,
       SUM(COUNT(*)) OVER (PARTITION BY APP_NAME ORDER BY first_day ASC) as CUMULATIVE_ACTIVE_USER_ACOUNT
FROM (SELECT CLIENT_SID, APP_NAME,
             MIN(DATE_TRUNC('day', HOUR_DIM.UTC)) as first_day
      FROM RPT.S_HOURLY_INACTIVE_TVS_AGG
      WHERE DATEDIFF('month', HOUR_DIM.UTC, CURRENT_DATE) BETWEEN 0 AND 0
      GROUP BY CLIENT_SID, APP_NAME
     ) cs
GROUP BY first_day, APP_NAME;

Gordon's update answer is good if you have enough data that every day, get a user that has a first day for each day in the month, but when the data is sparse like my example data, you don't get the results you expect

Gordon's code is effectively this:

WITH data AS (  
select hour_dim_utc::timestamp_ntz as hour_dim_utc, user_id from values
    ('2020-03-10 9:50', 1 ),
    ('2020-03-10 9:51', 3 ),
    ('2020-03-10 10:51', 3 ),
    ('2020-03-11 9:52', 1 ),
    ('2020-03-11 9:53', 2 ),
    ('2020-03-11 9:54', 0 ),
    ('2020-03-12 9:55', 0 ),
    ('2020-03-12 9:56', 1 ),
    ('2020-03-12 9:57', 3 ),
    ('2020-03-14 9:58', 2 ),
    ('2020-03-15 9:59', 3 ),
    ('2020-03-16 10:00', 2 ),
    ('2020-03-17 10:01', 2 ),
    ('2020-03-18 10:02', 0 ),
    ('2020-03-19 10:04', 11 )
     s( hour_dim_utc, user_id)
)
select 
    first_day
    ,sum(count(*)) over (ORDER BY first_day ASC) as acum 
from (
    select user_id
        ,min(hour_dim_utc::date) as first_day
    from data 
    group by 1
) group by 1;

which gives:

FIRST_DAY   ACUM
2020-03-10  2
2020-03-11  4
2020-03-19  5

I know this is old, but hopefully, this will help anyone looking for something similar.

If you look at the last post from the OP, there is no March 13th. As Simon mentioned, his data is sparse. To have one entry for every day, create a date spine. Using the SQL from the last post, I called a table that has an entry for each day (I called it DATE_KEY in the example below). Since those tables tend to go very far back or far forward, I queried the initial dataset for min() and max() values to limit the rows returned from the date table.

I left the first_day field in the query but commented out so you could uncomment it to see the relationship of the date spine to the date returned from your dataset.

WITH 
dates AS (
SELECT DATE_KEY
FROM my_date_table
)

,data AS (  
select hour_dim_utc::timestamp_ntz as hour_dim_utc, user_id from values
    ('2020-03-10 9:50', 1 ),
    ('2020-03-10 9:51', 3 ),
    ('2020-03-10 10:51', 3 ),
    ('2020-03-11 9:52', 1 ),
    ('2020-03-11 9:53', 2 ),
    ('2020-03-11 9:54', 0 ),
    ('2020-03-12 9:55', 0 ),
    ('2020-03-12 9:56', 1 ),
    ('2020-03-12 9:57', 3 ),
    ('2020-03-14 9:58', 2 ),
    ('2020-03-15 9:59', 3 ),
    ('2020-03-16 10:00', 2 ),
    ('2020-03-17 10:01', 2 ),
    ('2020-03-18 10:02', 0 ),
    ('2020-03-19 10:04', 11 )
     s( hour_dim_utc, user_id)
)
,RANGES as (
    SELECT
    min(hour_dim_utc::date) AS min_day
    ,max(hour_dim_utc::date) AS max_day
    FROM data

)
, first_days AS (
select 
    first_day
    ,sum(count(*)) over (ORDER BY first_day ASC) as acum 
from (
    select user_id
        ,min(hour_dim_utc::date) as first_day
    from data 
    group by 1
) group by 1
)

SELECT 
    D.DATE_KEY
    -- ,FD.FIRST_DAY
    ,sum(FD.ACUM) over (ORDER BY DATE_KEY ASC) AS ACUM
FROM DATES D
inner join ranges ON d.date_key >= ranges.min_day and d.date_key <= ranges.max_day
LEFT JOIN FIRST_DAYS FD ON  D.DATE_KEY = FD.FIRST_DAY 

which results in

+------------+------+
|  DATE_KEY  | ACUM |
+------------+------+
| 2020-03-10 |    2 |
| 2020-03-11 |    6 |
| 2020-03-12 |    6 |
| 2020-03-13 |    6 |
| 2020-03-14 |    6 |
| 2020-03-15 |    6 |
| 2020-03-16 |    6 |
| 2020-03-17 |    6 |
| 2020-03-18 |    6 |
| 2020-03-19 |   11 |
+------------+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM