简体   繁体   English

缺少当前日期的30天滚动/移动总和

[英]30-day rolling/moving sum when current date is missing

I have a table ( view_of_referred_events ) which stores the number of visitors for a given page. 我有一个表( view_of_referred_events ),用于存储给定页面的访问者数量。

date        country_id  referral    product_id  visitors
2016-04-01  216         pl          113759      1
2016-04-03  216         pl          113759      1
2016-04-06  216         pl          113759      13
2016-04-07  216         pl          113759      10

I want to compute the 30-day rolling/moving sum for this product, even for those days which are missing. 我想计算此产品的30天滚动/移动总和,即使对于那些缺少的日子也是如此。 So the end result should be something like the following: 因此,最终结果应类似于以下内容:

date        country_id  referral    product_id  cumulative_visitors
2016-04-01  216         pl          113759      1
2016-04-02  216         pl          113759      1
2016-04-03  216         pl          113759      2
2016-04-04  216         pl          113759      2
2016-04-05  216         pl          113759      2
2016-04-06  216         pl          113759      15
2016-04-07  216         pl          113759      25

Now, this is a simplistic representation, because I have tens of different country_id , referral and product_id . 现在,这是一个简单的表示形式,因为我country_id十个不同的country_idreferralproduct_id I can't pre-create a table with all possible combinations of { date , country_id , referral and product_id } because this would become untreatable considering the size of the table. 我无法使用{ datecountry_idreferralproduct_id }的所有可能组合预先创建表,因为考虑到表的大小,这将变得无法处理。 I don't also want to have a row in the final table if that specific { date , country_id , referral and product_id } didn't exist before. 如果特定的{ datecountry_idreferralproduct_id }之前不存在,我也不想在决赛桌中country_id一行。

I was thinking if there was an easy way to tell Impala to use the value of the previous row (the previous day) if in view_of_referred_events there are no visitors for that day. 我在考虑是否有一种简单的方法来告诉Impala使用view_of_referred_events的前一天(前一天)的值(如果当天没有访客)。

I wrote this query, where list_of_dates is a table with a list of days from April 1st to April 7th. 我编写了此查询,其中list_of_dates是一个表,其中列出了从4月1日到4月7日的天数。

select
  t.`date`,
  t.country_id,
  t.referral,
  t.product_id,
  sum(visitors) over (partition by t.country_id, t.referral, t.product_id order by t.`date`
                     rows between 30 preceding and current row) as cumulative_sum_visitors
from (
  selec
    d.`date`, 
    re.country_id, 
    re.referral, 
    re.product_id,
    sum(visitors) as visitors
  from list_of_dates d
  left outer join view_of_referred_events re on d.`date` = re.`date`
    and re.referral = "pl"
    and re.product_id = "113759"
    and re.country_id = "216"
  group by d.`date`, re.country_id, re.referral, re.product_id
  ) t
order by t.`date` asc;

This returns something similar to what I want, but not exactly that. 这将返回与我想要的东西相似的东西,但不完全相同。

date        country_id  referral    product_id  cumulative_visitors
2016-04-01  216         pl          113759      1
2016-04-02  NULL        NULL        NULL        NULL
2016-04-03  216         pl          113759      2
2016-04-04  NULL        NULL        NULL        NULL
2016-04-05  NULL        NULL        NULL        NULL
2016-04-06  216         pl          113759      15
2016-04-07  216         pl          113759      25

I'm not sure how goo the performance will be, but you can do this by aggregating the data twice and adding 30 days for the second aggregation and negating the count. 我不确定性能会如何,但是您可以通过两次汇总数据并为第二次汇总增加30天并取反计数来实现。

Something like this: 像这样:

with t as (
      select d.`date`, re.country_id, re.referral, re.product_id,
             sum(visitors) as visitors
      from list_of_dates d left outer join
           view_of_referred_events re
           on d.`date` = re.`date` and
              re.referral = 'pl' and
              re.product_id = 113759 and
              re.country_id = 216
      group by d.`date`, re.country_id, re.referral, re.product_id
     )
select date, country_id, referral, product_id,
       sum(sum(visitors)) over (partition by country_id, referral, product_id order by date) as visitors
from ((select date, country_id, referral, product_id, visitors
       from t
      ) union all
      (select date_add(date, 30), country_id, referral, product_id, -visitors
       from t
      ) 
     ) tt
group by date, country_id, referral, product_id;

I have added another sub query to get the value from the last row in the partition. 我添加了另一个子查询,以从分区的最后一行获取值。 I am not sure what version of hive/impala you are using, last_value(column_name, ignore null values true/false) is the syntax. 我不确定您使用的是哪个版本的蜂巢/黑斑羚, last_value(column_name, ignore null values true/false)是语法。

I assume you are trying to find the cumulative counts over a 30 days (month), I recommend using month field to group the rows. 我假设您正在尝试查找30天(一个月)内的累计计数,建议您使用“月”字段对行进行分组。 The month could come either from your dimension table list_of_dates or just substr(date, 1, 7) and get the cumulative counts of visitors over ..rows unbounded preceding and current row . 该月份可以来自您的维度表list_of_dates也可以来自substr(date, 1, 7)并获得..rows unbounded preceding and current row的累积访问者计数。

query: 查询:

select
  `date`,
  country_id,
  referral,
  product_id,
  sum(visitors) over (partition by country_id, referral, product_id order by `date`
                     rows between 30 preceding and current row) as cumulative_sum_visitors 
from (select
  t.`date`,
  -- get the last not null value from the partition window w for country_id, referral & product_id
  last_value(t.country_id, true) over w as country_id,
  last_value(t.referral, true) over w as  referral
  last_value(t.product_id, true) over w as product_id 
  if(visitors = null, 0, visitors) as visitors 
from (
  select
    d.`date`, 
    re.country_id, 
    re.referral, 
    re.product_id,
    sum(visitors) as visitors
  from list_of_dates d
  left outer join view_of_referred_events re on d.`date` = re.`date`
    and re.referral = "pl"
    and re.product_id = "113759"
    and re.country_id = "216"
  group by d.`date`, re.country_id, re.referral, re.product_id
  ) t
window w as (partition by t.country_id, t.referral, t.product_id order by t.`date`
                     rows between unbounded preceding and unbounded following)) t1
order by `date` asc;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM