简体   繁体   English

如何使用 hive 简化计算效率?

[英]how to simplified the calculate efficiency with hive?

The code is running on hive:代码在 hive 上运行:

select day,count(mdn)*5 as number from
(select distinct a.mdn,a.day from 
flow a
left outer join
flow b
on a.day=date_add(b.day,-1) and a.mdn=b.mdn
left outer join
flow c
on a.day=date_add(c.day,-2) and a.mdn=c.mdn
left outer join
flow d
on a.day=date_add(d.day,-3) and a.mdn=d.mdn
where b.mdn is null  and c.mdn is null  and d.mdn is null)t 
group by day

The logic of code is that select the one mdn today who is not appeared in future three days, and calculate the number of mdn.But the efficiency of this code is so low because of three times join with the same big table flow.代码的逻辑是选择今天三天内没有出现的一个mdn,然后计算mdn的个数。但是这个代码的效率太低了,因为3次join同一个大表流。 How to simplify it with high efficiency?如何高效地简化它?

Well, you can look at the next day using lead() and compare the date times:好吧,您可以使用lead()查看第二天​​并比较日期时间:

select f.*
from (select f.*,
             lead(f.day) over (partition by f.mdn order by f.day) as next_day
      from flow f
     ) f
where next_day > date_add(day, 3) or next_date is null;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM