繁体   English   中英

蜂巢数据在日期上的滚动总和

[英]Hive rolling sum of data over date

我正在开发Hive,并且面临滚动计数的问题。 我正在处理的样本数据如下所示:

在此处输入图片说明

我期望的输出如下所示:

在此处输入图片说明

我尝试使用以下查询,但未返回滚动计数:

select event_dt,status, count(distinct account) from
(select *, row_number() over (partition by account order by event_dt 
desc) 
as rnum from table.A 
where event_dt between '2018-05-02' and '2018-05-04') x where rnum =1 
group by event_dt, status;

如果有人解决了类似的问题,请帮助我。

您似乎只想要条件聚合:

select event_dt,
       sum(case when status = 'Registered' then 1 else 0 end) as registered,
       sum(case when status = 'active_acct' then 1 else 0 end) as active_acct,
       sum(case when status = 'suspended' then 1 else 0 end) as suspended,
       sum(case when status = 'reactive' then 1 else 0 end) as reactive
from table.A 
group by event_dt
order by event_dt;

编辑:

这是一个棘手的问题。 我想出的解决方案是日期和用户的叉积,然后计算每个日期的最新状态。

所以:

select a.event_dt,
       sum(case when aa.status = 'Registered' then 1 else 0 end) as registered,
       sum(case when aa.status = 'active_acct' then 1 else 0 end) as active_acct,
       sum(case when aa.status = 'suspended' then 1 else 0 end) as suspended,
       sum(case when aa.status = 'reactive' then 1 else 0 end) as reactive
from (select d.event_dt, ac.account, a.status,
             max(case when a.status is not null then a.timestamp end) over (partition by ac.account order by d.event_dt) as last_status_timestamp
      from (select distinct event_dt from table.A) d cross join
           (select distinct account from table.A) ac left join
           (select a.*,
                   row_number() over (partition by account, event_dt order by timestamp desc) as seqnum
            from table.A a
           ) a
           on a.event_dt = d.event_dt and
              a.account = ac.account and
              a.seqnum = 1  -- get the last one on the date
     ) a left join
     table.A aa
     on aa.timestamp = a.last_status_timestamp and
        aa.account = a.account
group by d.event_dt
order by d.event_dt;

这是在创建一个派生表,其中包含所有帐户和日期的行。 在某些日子(而非全部日子)具有状态。

last_status_timestamp的累积最大值将计算具有有效状态的最新时间戳。 然后将其重新加入表格以获取该日期的状态。 瞧! 这是用于条件聚合的状态。

累积的最大和联接是一种变通方法,因为Hive(尚未?)不支持lag()ignore nulls选项。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM