![](/img/trans.png)
[英]How to sum OVER (PARTITION BY DISTINCT) for Distinct Values
[英]How to run SUM() OVER PARTITION BY for COUNT DISTINCT
我試圖在每天的水平上獲取每個事件的不同用戶數量,同時保持每小時的運行總和。 我使用 Athena/Presto 作為查詢引擎。
我嘗試了以下查詢:
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
count,
SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
COUNT(DISTINCT moengageuserid) as count
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
GROUP BY 1,2
ORDER BY 1,2
);
但是看到結果后,我意識到采用 COUNT DISTINCT 的 SUM 是不正確的,因為它不是相加的。
所以,我嘗試了以下查詢
SELECT
eventname,
date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
SELECT
eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1', 'e2', 'e3', 'e4')
);
但是此查詢失敗並出現以下錯誤:
SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause
要計算運行不同的計數,您可以將用戶 ID 收集到集合(不同的數組)中並獲取大小:
cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
這是分析 function 並將為整個分區(事件名稱,日期)分配相同的值,您可以使用 max() 等聚合上子查詢中的記錄。
計算用戶第一次出現的運行不同計數:
SELECT eventname, date(from_unixtime(time_bucket)) AS date,
(time_bucket % 86400)/3600 as hour,
COUNT(DISTINCT moengageuserid) as hour_cont,
SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
moengageuserid as hour_count,
ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
FROM clickstream.moengage
WHERE date = '2020-08-20' AND
eventname IN ('e1', 'e2', 'e3', 'e4')
) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.