简体   繁体   中英

query for finding new values not available in the previous time point for time series data in hive table

I would like to query data from pyspark hive table.

The table:

   year month ids
   2005 10    csec
   2005 10    thth
   2005 11    csec
   2005 11    thth
   2005 11    yjsd
   2005 12    yjwe
   2005 12    yjsd

I need:

   year month ids
   2005 11    yjsd -- not appear in 200510
   2005 12    yjwe -- nor appear in 200511

It is to find the new "ids" that is not available in the previous month.

My sql:

  select a.year, a.month, count(distinct(a.ids)) as dist_ids
  from MY_TABLE as a
  where a.ids not in 
  (
    select distinct b.ids
    from MY_TABLE as b
    where isnull(b.ids) = false 
        and (a.year  = b.year and a.month  - 1 = b.month) or (a.year - 1 = b.year and 
      a.month = 1 and b.month = 12)
   ) 
   group by year, month
   order by year, month

But, the query is very slow.

how to speed up?

thanks

If you have only one table, you can use window functions. Assuming that you have one row per month per id as in your sample data:

select a.year, a.month, count(distinct(a.ids)) as dist_ids
from (select a.*,
             lag(year * 12 + month) over (partition by id order by year, month) as prev_yyyymm
      from MY_TABLE a
     ) a
where prev_yyyymm is null or prev_yyyymm <> year * 12 + month - 1
group by year, month
order by year, month

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM