简体   繁体   中英

How to run SUM() OVER PARTITION BY for COUNT DISTINCT

I'm trying to get the number of distinct users for each event at a daily level while maintainig a running sum for every hour. I'm using Athena/Presto as the query engine.

I tried the following query:

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    count,
    SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
    SELECT 
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        COUNT(DISTINCT moengageuserid) as count
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
    GROUP BY 1,2
    ORDER BY 1,2
);

But on seeing the results I realized that taking SUM of COUNT DISTINCT is not correct as it's not additive.

So, I tried the below query

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
    SELECT
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        moengageuserid
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
);

But this query fails with the following error:

SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause

To calculate running distinct count you can collect user IDs into set (distinct array) and get the size:

cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum

This is analytic function and will assign the same value to the whole partition (eventname, date), you can aggregate records in upper subquery using max(), etc.

Count the first time a user appears for the running distinct count:

SELECT eventname, date(from_unixtime(time_bucket)) AS date,
       (time_bucket % 86400)/3600 as hour,
       COUNT(DISTINCT moengageuserid) as hour_cont,
       SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
             CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
             moengageuserid as hour_count,
             ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
      FROM clickstream.moengage
      WHERE date = '2020-08-20' AND
            eventname IN ('e1', 'e2', 'e3', 'e4')
    ) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM