简体   繁体   中英

Distinct in Window Functions. BigQuery

I'm trying to do something like this in BigQuery COUNT(DISTINCT user_id) OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS

In other words, I have a table with Date, Userid, Sample and Application ID. I need to count the cumulative number of unique active users for each day starting from the beginning of the month and ending with the current day.

The function works properly without distinct, however, this gives me a total count of users and it's not what I need.

Tried some tricks with dense_rank, however it doesn't work here as well.

Are there any ways to calculative the number of distinct users using window functions?

-------------UPDATED---------------- here is the full query, so you could better understand what I need

    with mtd1 as (select  
'MonthToDate' as TIMELINE
,fd.date DATE
,td.SAMPLE as SAMPLE
,td.APPNAME as APP_ID 
,sum(fd.revenue) as REVENUE 
,td.user_id ACTIVE_USERS 
from DWH.DailyUser fd 
join DWH.Depositors td using (userid)
group by 1,2,3,4,6
),
mtd as (
select TIMELINE
,DATE
,SAMPLE
,APP_ID
,sum(revenue) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as REVENUE
,COUNT(distinct active_users) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS 
from mtd1
)
select * from mtd 
where extract(day from date) = extract(day from current_date)
group by 1,2,3,4,5,6 

You can use ARRAY_AGG , then count the distinct elements in each array. Note that your query will run out of memory if the arrays end up being too big, though.

with mtd1 as (select  
'MonthToDate' as TIMELINE
,fd.date DATE
,td.SAMPLE as SAMPLE
,td.APPNAME as APP_ID 
,sum(fd.revenue) as REVENUE 
,td.user_id ACTIVE_USERS 
from DWH.DailyUser fd 
join DWH.Depositors td using (userid)
group by 1,2,3,4,6
),
mtd1 as (
select TIMELINE
,DATE
,SAMPLE
,APP_ID
,sum(revenue) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as REVENUE
,ARRAY_AGG(active_users) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS 
from mtd1
), mtd AS (
  SELECT * EXCEPT(ACTIVE_USERS),
    (SELECT COUNT(DISTINCT u) FROM UNNEST(ACTIVE_USERS) AS u) AS ACTIVE_USERS
   FROM mtd1
)
select * from mtd 
where extract(day from date) = extract(day from current_date)
group by 1,2,3,4,5,6

Distinct in Window Functions. BigQuery - Are there any ways to calculate the number of distinct users using window functions?

This specific question is a duplicate and already answered here

... here is the full query ...

As of how to apply above to your particular query - see below (not tested and fully based on your code

#standardSQL
WITH mtd1 AS (
  SELECT  
    'MonthToDate' AS TIMELINE
    ,fd.date DATE
    ,td.SAMPLE AS SAMPLE
    ,td.APPNAME AS APP_ID 
    ,SUM(fd.revenue) AS REVENUE 
    ,td.user_id ACTIVE_USERS 
  FROM `DWH.DailyUser` fd 
  JOIN `DWH.Depositors` td USING (userid)
  GROUP BY 1,2,3,4,6
), mtd2 AS (
  SELECT 
    TIMELINE
    ,DATE
    ,SAMPLE
    ,APP_ID
    ,SUM(REVENUE) OVER (PARTITION BY DATE_TRUNC(DATE, MONTH), SAMPLE, APP_ID ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS REVENUE
    ,ARRAY_AGG(ACTIVE_USERS) OVER (PARTITION BY DATE_TRUNC(DATE, MONTH), SAMPLE, APP_ID ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ACTIVE_USERS 
  FROM mtd1
), mtd AS (
  SELECT * REPLACE((SELECT COUNT(DISTINCT u) FROM UNNEST(ACTIVE_USERS) AS u) AS ACTIVE_USERS)
  FROM mtd2
)
SELECT * FROM mtd 
WHERE EXTRACT(day FROM DATE) = EXTRACT(day FROM CURRENT_DATE)
GROUP BY 1,2,3,4,5,6

One method for implementing count(distinct) uses row_number() and then counts the "1"s:

select SUM(CASE WHEN seqnum = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id ORDER BY date) as Active_Users
FROM (SELECT t.*,
             ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id, user_id ORDER BY DATE) as seqnum
      FROM t
     ) t

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM