繁体   English   中英

如何为 COUNT DISTINCT 运行 SUM() OVER PARTITION BY

[英]How to run SUM() OVER PARTITION BY for COUNT DISTINCT

我试图在每天的水平上获取每个事件的不同用户数量,同时保持每小时的运行总和。 我使用 Athena/Presto 作为查询引擎。

我尝试了以下查询:

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    count,
    SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
    SELECT 
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        COUNT(DISTINCT moengageuserid) as count
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
    GROUP BY 1,2
    ORDER BY 1,2
);

但是看到结果后,我意识到采用 COUNT DISTINCT 的 SUM 是不正确的,因为它不是相加的。

所以,我尝试了以下查询

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
    SELECT
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        moengageuserid
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
);

但是此查询失败并出现以下错误:

SYNTAX_ERROR: line 5:99: ORDER BY expression '"time_bucket"' must be an aggregate expression or appear in GROUP BY clause

要计算运行不同的计数,您可以将用户 ID 收集到集合(不同的数组)中并获取大小:

cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum

这是分析 function 并将为整个分区(事件名称,日期)分配相同的值,您可以使用 max() 等聚合上子查询中的记录。

计算用户一次出现的运行不同计数:

SELECT eventname, date(from_unixtime(time_bucket)) AS date,
       (time_bucket % 86400)/3600 as hour,
       COUNT(DISTINCT moengageuserid) as hour_cont,
       SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
             CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
             moengageuserid as hour_count,
             ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
      FROM clickstream.moengage
      WHERE date = '2020-08-20' AND
            eventname IN ('e1', 'e2', 'e3', 'e4')
    ) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM