简体   繁体   中英

Most recurring value in a specific time window

I have the following structure:

table user
user_id | month_year | fruits  
------------------------------
1       | 2021-01    | apple
1       | 2021-01    | melon
1       | 2021-01    | orange
1       | 2021-02    | grape
1       | 2021-02    | orange
1       | 2021-02    | kiwi
1       | 2021-03    | grape
1       | 2021-03    | pear
1       | 2021-03    | banana
1       | 2021-04    | orange
1       | 2021-04    | kiwi
1       | 2021-04    | banana
1       | 2021-05    | grape
1       | 2021-05    | pear
1       | 2021-05    | kiwi

And I want the following result:

user     | month_year |            fruits            |  two_months_most_freq
-------------------------------------------------------------------------
1        | 2021-01    | apple, melon, orange         | orange
1        | 2021-02    | grape, orange, kiwi          | orange
1        | 2021-03    | grape, pear, banana          | grape
1        | 2021-04    | orange, kiwi, banana         | banana
1        | 2021-05    | grape, pear, kiwi            | kiwi

Clearing: In the last column I want the most recurrent fruit in the last 2 months period, in other words, the most repeated in the actual and previous line. Notice that in the first line should return orange because the window frame ahead should be used when the window frame behind is not available.

In the code below, I achieved the most recurrent fruit in the whole dataset.

select * from (
  select user_id, year_month, 
    string_agg(distinct fruit) as fruits
  from user
  group by  user_id, year_month
) join (
  select user_id, fruit
  from user
  group by user_id, fruit
  qualify 1 = row_number() over(partition by user_id order by count(*) desc)
)
using (user_id)   

How can I apply this logic for specific time windows?

Consider below

select user_id, month_year, fruits, 
  if(prev_month_exists, two_months_most_freq, first_value(two_months_most_freq) over next_month) as two_months_most_freq 
from (
  select user_id, month_year, fruits, 
    ( select fruit from unnest(split(two_month_fruits)) fruit
      group by fruit order by count(*) desc limit 1
    ) as two_months_most_freq, 
    month, prev_month_exists
  from (
    select distinct user_id, month_year, month, 
      string_agg(fruit) over(partition by user_id, month_year) fruits,
      string_agg(fruit) over last_two_months as two_month_fruits,
      0 < count(*) over prev_month as prev_month_exists
    from users, unnest([struct(
      12 * extract(year from date(month_year || '-01')) + extract(month from date(month_year || '-01')) as month
    )]) 
    window 
      last_two_months as (partition by user_id order by month range between 1 preceding and current row), 
      prev_month as (partition by user_id order by month range between 1 preceding and 1 preceding)
  )
)
window next_month as (partition by user_id order by month range between 1 following and 1 following)                 

if applied to sample data in your question - output is

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM