I would like to query data from pyspark hive table.
The table:
year month ids
2005 10 csec
2005 10 thth
2005 11 csec
2005 11 thth
2005 11 yjsd
2005 12 yjwe
2005 12 yjsd
I need:
year month ids
2005 11 yjsd -- not appear in 200510
2005 12 yjwe -- nor appear in 200511
It is to find the new "ids" that is not available in the previous month.
My sql:
select a.year, a.month, count(distinct(a.ids)) as dist_ids
from MY_TABLE as a
where a.ids not in
(
select distinct b.ids
from MY_TABLE as b
where isnull(b.ids) = false
and (a.year = b.year and a.month - 1 = b.month) or (a.year - 1 = b.year and
a.month = 1 and b.month = 12)
)
group by year, month
order by year, month
But, the query is very slow.
how to speed up?
thanks
If you have only one table, you can use window functions. Assuming that you have one row per month per id as in your sample data:
select a.year, a.month, count(distinct(a.ids)) as dist_ids
from (select a.*,
lag(year * 12 + month) over (partition by id order by year, month) as prev_yyyymm
from MY_TABLE a
) a
where prev_yyyymm is null or prev_yyyymm <> year * 12 + month - 1
group by year, month
order by year, month
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.