query for finding new values not available in the previous time point for time series data in hive table

Question

I would like to query data from pyspark hive table.

The table:

   year month ids
   2005 10    csec
   2005 10    thth
   2005 11    csec
   2005 11    thth
   2005 11    yjsd
   2005 12    yjwe
   2005 12    yjsd

I need:

   year month ids
   2005 11    yjsd -- not appear in 200510
   2005 12    yjwe -- nor appear in 200511

It is to find the new "ids" that is not available in the previous month.

My sql:

  select a.year, a.month, count(distinct(a.ids)) as dist_ids
  from MY_TABLE as a
  where a.ids not in 
  (
    select distinct b.ids
    from MY_TABLE as b
    where isnull(b.ids) = false 
        and (a.year  = b.year and a.month  - 1 = b.month) or (a.year - 1 = b.year and 
      a.month = 1 and b.month = 12)
   ) 
   group by year, month
   order by year, month

But, the query is very slow.

how to speed up?

thanks

Answer 1

If you have only one table, you can use window functions. Assuming that you have one row per month per id as in your sample data:

select a.year, a.month, count(distinct(a.ids)) as dist_ids
from (select a.*,
             lag(year * 12 + month) over (partition by id order by year, month) as prev_yyyymm
      from MY_TABLE a
     ) a
where prev_yyyymm is null or prev_yyyymm <> year * 12 + month - 1
group by year, month
order by year, month

query for finding new values not available in the previous time point for time series data in hive table

Question

1 answers

solution1
1 ACCPTED 2020-08-05 00:57:29

query for finding new values not available in the previous time point for time series data in hive table

Question

1 answers

solution1 1 ACCPTED 2020-08-05 00:57:29

solution1
1 ACCPTED 2020-08-05 00:57:29