简体   繁体   中英

Hive rolling sum of data over date

I am working on Hive and am facing an issue with rolling counts. The sample data I am working on is as shown below:

在此处输入图片说明

and the output I am expecting is as shown below:

在此处输入图片说明

I tried using the following query but it is not returning the rolling count:

select event_dt,status, count(distinct account) from
(select *, row_number() over (partition by account order by event_dt 
desc) 
as rnum from table.A 
where event_dt between '2018-05-02' and '2018-05-04') x where rnum =1 
group by event_dt, status;

Please help me with this if some one has solved a similar issue.

You seem to just want conditional aggregation:

select event_dt,
       sum(case when status = 'Registered' then 1 else 0 end) as registered,
       sum(case when status = 'active_acct' then 1 else 0 end) as active_acct,
       sum(case when status = 'suspended' then 1 else 0 end) as suspended,
       sum(case when status = 'reactive' then 1 else 0 end) as reactive
from table.A 
group by event_dt
order by event_dt;

EDIT:

This is a tricky problem. The solution I've come up with does a cross-product of dates and users and then calculates the most recent status as of each date.

So:

select a.event_dt,
       sum(case when aa.status = 'Registered' then 1 else 0 end) as registered,
       sum(case when aa.status = 'active_acct' then 1 else 0 end) as active_acct,
       sum(case when aa.status = 'suspended' then 1 else 0 end) as suspended,
       sum(case when aa.status = 'reactive' then 1 else 0 end) as reactive
from (select d.event_dt, ac.account, a.status,
             max(case when a.status is not null then a.timestamp end) over (partition by ac.account order by d.event_dt) as last_status_timestamp
      from (select distinct event_dt from table.A) d cross join
           (select distinct account from table.A) ac left join
           (select a.*,
                   row_number() over (partition by account, event_dt order by timestamp desc) as seqnum
            from table.A a
           ) a
           on a.event_dt = d.event_dt and
              a.account = ac.account and
              a.seqnum = 1  -- get the last one on the date
     ) a left join
     table.A aa
     on aa.timestamp = a.last_status_timestamp and
        aa.account = a.account
group by d.event_dt
order by d.event_dt;

What this is doing is creating a derived table with rows for all accounts and dates. This has the status on certain days, but not all days.

The cumulative max for last_status_timestamp calculates the most recent timestamp that has a valid status. This is then joined back to the table to get the status on that date. Voila! This is the status used for the conditional aggregation.

The cumulative max and join is a work-around because Hive does not (yet?) support the ignore nulls option in lag() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM