[英]how to simplified the calculate efficiency with hive?
代码在 hive 上运行:
select day,count(mdn)*5 as number from
(select distinct a.mdn,a.day from
flow a
left outer join
flow b
on a.day=date_add(b.day,-1) and a.mdn=b.mdn
left outer join
flow c
on a.day=date_add(c.day,-2) and a.mdn=c.mdn
left outer join
flow d
on a.day=date_add(d.day,-3) and a.mdn=d.mdn
where b.mdn is null and c.mdn is null and d.mdn is null)t
group by day
代码的逻辑是选择今天三天内没有出现的一个mdn,然后计算mdn的个数。但是这个代码的效率太低了,因为3次join同一个大表流。 如何高效地简化它?
好吧,您可以使用lead()
查看第二天并比较日期时间:
select f.*
from (select f.*,
lead(f.day) over (partition by f.mdn order by f.day) as next_day
from flow f
) f
where next_day > date_add(day, 3) or next_date is null;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.