如何使用 hive 简化计算效率？

Question

The code is running on hive:代码在 hive 上运行：

select day,count(mdn)*5 as number from
(select distinct a.mdn,a.day from 
flow a
left outer join
flow b
on a.day=date_add(b.day,-1) and a.mdn=b.mdn
left outer join
flow c
on a.day=date_add(c.day,-2) and a.mdn=c.mdn
left outer join
flow d
on a.day=date_add(d.day,-3) and a.mdn=d.mdn
where b.mdn is null  and c.mdn is null  and d.mdn is null)t 
group by day

The logic of code is that select the one mdn today who is not appeared in future three days, and calculate the number of mdn.But the efficiency of this code is so low because of three times join with the same big table flow.代码的逻辑是选择今天三天内没有出现的一个mdn，然后计算mdn的个数。但是这个代码的效率太低了，因为3次join同一个大表流。 How to simplify it with high efficiency?如何高效地简化它？

Answer 1

Well, you can look at the next day using lead() and compare the date times:好吧，您可以使用lead()查看第二天并比较日期时间：

select f.*
from (select f.*,
             lead(f.day) over (partition by f.mdn order by f.day) as next_day
      from flow f
     ) f
where next_day > date_add(day, 3) or next_date is null;

如何使用 hive 简化计算效率？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-03-01 02:53:32

如何使用 hive 简化计算效率？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-03-01 02:53:32

解决方案1
1 已采纳 2018-03-01 02:53:32