簡體   English   中英

如何為 COUNT DISTINCT 運行 SUM() OVER PARTITION BY

[英]How to run SUM() OVER PARTITION BY for COUNT DISTINCT

我試圖在每天的水平上獲取每個事件的不同用戶數量,同時保持每小時的運行總和。 我使用 Athena/Presto 作為查詢引擎。

我嘗試了以下查詢:

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    count,
    SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
    SELECT 
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        COUNT(DISTINCT moengageuserid) as count
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
    GROUP BY 1,2
    ORDER BY 1,2
);

但是看到結果后,我意識到采用 COUNT DISTINCT 的 SUM 是不正確的,因為它不是相加的。

所以,我嘗試了以下查詢

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
    SELECT
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        moengageuserid
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
);

但是此查詢失敗並出現以下錯誤:

SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause

要計算運行不同的計數,您可以將用戶 ID 收集到集合(不同的數組)中並獲取大小:

cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum

這是分析 function 並將為整個分區(事件名稱,日期)分配相同的值,您可以使用 max() 等聚合上子查詢中的記錄。

計算用戶一次出現的運行不同計數:

SELECT eventname, date(from_unixtime(time_bucket)) AS date,
       (time_bucket % 86400)/3600 as hour,
       COUNT(DISTINCT moengageuserid) as hour_cont,
       SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
             CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
             moengageuserid as hour_count,
             ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
      FROM clickstream.moengage
      WHERE date = '2020-08-20' AND
            eventname IN ('e1', 'e2', 'e3', 'e4')
    ) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM