I'm trying to get the number of distinct users for each event at a daily level while maintainig a running sum for every hour. I'm using Athena/Presto as the query engine.
I tried the following query:
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
count,
SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
COUNT(DISTINCT moengageuserid) as count
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
GROUP BY 1,2
ORDER BY 1,2
);
But on seeing the results I realized that taking SUM of COUNT DISTINCT is not correct as it's not additive.
So, I tried the below query
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
);
But this query fails with the following error:
SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause
To calculate running distinct count you can collect user IDs into set (distinct array) and get the size:
cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
This is analytic function and will assign the same value to the whole partition (eventname, date), you can aggregate records in upper subquery using max(), etc.
Count the first time a user appears for the running distinct count:
SELECT eventname, date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
COUNT(DISTINCT moengageuserid) as hour_cont,
SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid as hour_count,
ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
FROM clickstream.moengage
WHERE date = '2020-08-20' AND
eventname IN ('e1', 'e2', 'e3', 'e4')
) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.