[英]Create Missing Data Hive SQL
我有一张表,上面有事情发生变化的活动日期,例如
2020-08-13 123 Upgrade
2020-08-17 123 Downgrade
2020-08-21 123 Upgrade
基本上这与一条线有关,此帐户上发生了 3 项活动。 他们有一个基本帐户,然后降级,然后又升级
我想让这些发生在诸如
2020-08-13 123 Upgrade1
2020-08-14 123 Upgrade1
2020-08-15 123 Upgrade1
2020-08-16 123 Upgrade1
2020-08-17 123 Downgrade1
2020-08-18 123 Downgrade1
2020-08-19 123 Downgrade1
2020-08-20 123 Downgrade1
2020-08-21 123 Upgrade2
.
.
.
2020-09-09 123 Upgrade2
然后我想按他们的活动对他们进行分区,并在最终结果中看到这一点,以便我可以计算有多少用户保持降级状态超过 30 天,以查看他们与升级更改相比的行为。
2020-08-13 123 Upgrade1. 1
2020-08-14 123 Upgrade1. 2
2020-08-15 123 Upgrade1. 3
2020-08-16 123 Upgrade1. 4
2020-08-17 123 Downgrade1. 1
2020-08-18 123 Downgrade1. 2
2020-08-19 123 Downgrade1. 3
2020-08-20 123 Downgrade1. 4
2020-08-21 123 Upgrade2. 1
.
.
.
2020-09-09 123 Upgrade2. 18
我试过先 Coalesce 然后 row_num 但我无法理解如何根据他们更改帐户状态的时间来划分每个活动。
使用poseexplode(split(space(datediff(next_date,activity_date)-1),' ')) 生成行。 当先前活动<>当前活动时计算新组标志。 使用解析 sum() 计算组(分区)数。 查看代码中的注释:
with mydata as (
select stack(3,
'2020-08-13', 123, 'Upgrade',
'2020-08-17', 123, 'Downgrade',
'2020-08-21', 123, 'Upgrade'
) as (activity_date, account, activity)
)
--calculate row_number in account, activity
select activity_date, account, activity, activity_partition,
row_number() over(partition by account, activity_partition order by activity_date ) activity_partition_rn,
count(*) over(partition by account, activity_partition ) days_on_activity
from
(--Calculate partition
select activity_date, account, activity,
concat(activity,
sum(new_group_flag) over(partition by account, activity order by activity_date rows between unbounded preceding and current row)
) activity_partition
from
(--Calculate new group flag
select activity_date, account, activity,
case when lag(activity) over (partition by account order by activity_date) = activity then 0 else 1 end as new_group_flag
from
( --generate Date range
select date_add(activity_date,i) as activity_date, account, activity
from
( --Get next_date to generate date range
select activity_date, account, activity,
lead(activity_date,1, activity_date) over (partition by account order by activity_date) next_date
from mydata d
) s lateral view posexplode(split(space(datediff(next_date,activity_date)-1),' ')) e as i,x --generate rows
)s
)s
)s
order by activity_date;
结果:
activity_date account activity activity_partition activity_partition_rn days_on_activity
2020-08-13 123 Upgrade Upgrade1 1 4
2020-08-14 123 Upgrade Upgrade1 2 4
2020-08-15 123 Upgrade Upgrade1 3 4
2020-08-16 123 Upgrade Upgrade1 4 4
2020-08-17 123 Downgrade Downgrade1 1 4
2020-08-18 123 Downgrade Downgrade1 2 4
2020-08-19 123 Downgrade Downgrade1 3 4
2020-08-20 123 Downgrade Downgrade1 4 4
2020-08-21 123 Upgrade Upgrade2 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.