[英]30-day rolling/moving sum when current date is missing
I have a table ( view_of_referred_events
) which stores the number of visitors for a given page. 我有一个表(
view_of_referred_events
),用于存储给定页面的访问者数量。
date country_id referral product_id visitors
2016-04-01 216 pl 113759 1
2016-04-03 216 pl 113759 1
2016-04-06 216 pl 113759 13
2016-04-07 216 pl 113759 10
I want to compute the 30-day rolling/moving sum for this product, even for those days which are missing. 我想计算此产品的30天滚动/移动总和,即使对于那些缺少的日子也是如此。 So the end result should be something like the following:
因此,最终结果应类似于以下内容:
date country_id referral product_id cumulative_visitors
2016-04-01 216 pl 113759 1
2016-04-02 216 pl 113759 1
2016-04-03 216 pl 113759 2
2016-04-04 216 pl 113759 2
2016-04-05 216 pl 113759 2
2016-04-06 216 pl 113759 15
2016-04-07 216 pl 113759 25
Now, this is a simplistic representation, because I have tens of different country_id
, referral
and product_id
. 现在,这是一个简单的表示形式,因为我
country_id
十个不同的country_id
, referral
和product_id
。 I can't pre-create a table with all possible combinations of { date
, country_id
, referral
and product_id
} because this would become untreatable considering the size of the table. 我无法使用{
date
, country_id
, referral
和product_id
}的所有可能组合预先创建表,因为考虑到表的大小,这将变得无法处理。 I don't also want to have a row in the final table if that specific { date
, country_id
, referral
and product_id
} didn't exist before. 如果特定的{
date
, country_id
, referral
和product_id
}之前不存在,我也不想在决赛桌中country_id
一行。
I was thinking if there was an easy way to tell Impala to use the value of the previous row (the previous day) if in view_of_referred_events
there are no visitors for that day. 我在考虑是否有一种简单的方法来告诉Impala使用
view_of_referred_events
的前一天(前一天)的值(如果当天没有访客)。
I wrote this query, where list_of_dates
is a table with a list of days from April 1st to April 7th. 我编写了此查询,其中
list_of_dates
是一个表,其中列出了从4月1日到4月7日的天数。
select
t.`date`,
t.country_id,
t.referral,
t.product_id,
sum(visitors) over (partition by t.country_id, t.referral, t.product_id order by t.`date`
rows between 30 preceding and current row) as cumulative_sum_visitors
from (
selec
d.`date`,
re.country_id,
re.referral,
re.product_id,
sum(visitors) as visitors
from list_of_dates d
left outer join view_of_referred_events re on d.`date` = re.`date`
and re.referral = "pl"
and re.product_id = "113759"
and re.country_id = "216"
group by d.`date`, re.country_id, re.referral, re.product_id
) t
order by t.`date` asc;
This returns something similar to what I want, but not exactly that. 这将返回与我想要的东西相似的东西,但不完全相同。
date country_id referral product_id cumulative_visitors
2016-04-01 216 pl 113759 1
2016-04-02 NULL NULL NULL NULL
2016-04-03 216 pl 113759 2
2016-04-04 NULL NULL NULL NULL
2016-04-05 NULL NULL NULL NULL
2016-04-06 216 pl 113759 15
2016-04-07 216 pl 113759 25
I'm not sure how goo the performance will be, but you can do this by aggregating the data twice and adding 30 days for the second aggregation and negating the count. 我不确定性能会如何,但是您可以通过两次汇总数据并为第二次汇总增加30天并取反计数来实现。
Something like this: 像这样:
with t as (
select d.`date`, re.country_id, re.referral, re.product_id,
sum(visitors) as visitors
from list_of_dates d left outer join
view_of_referred_events re
on d.`date` = re.`date` and
re.referral = 'pl' and
re.product_id = 113759 and
re.country_id = 216
group by d.`date`, re.country_id, re.referral, re.product_id
)
select date, country_id, referral, product_id,
sum(sum(visitors)) over (partition by country_id, referral, product_id order by date) as visitors
from ((select date, country_id, referral, product_id, visitors
from t
) union all
(select date_add(date, 30), country_id, referral, product_id, -visitors
from t
)
) tt
group by date, country_id, referral, product_id;
I have added another sub query to get the value from the last row in the partition. 我添加了另一个子查询,以从分区的最后一行获取值。 I am not sure what version of hive/impala you are using,
last_value(column_name, ignore null values true/false)
is the syntax. 我不确定您使用的是哪个版本的蜂巢/黑斑羚,
last_value(column_name, ignore null values true/false)
是语法。
I assume you are trying to find the cumulative counts over a 30 days (month), I recommend using month field to group the rows. 我假设您正在尝试查找30天(一个月)内的累计计数,建议您使用“月”字段对行进行分组。 The month could come either from your dimension table
list_of_dates
or just substr(date, 1, 7)
and get the cumulative counts of visitors over ..rows unbounded preceding and current row
. 该月份可以来自您的维度表
list_of_dates
也可以来自substr(date, 1, 7)
并获得..rows unbounded preceding and current row
的累积访问者计数。
query: 查询:
select
`date`,
country_id,
referral,
product_id,
sum(visitors) over (partition by country_id, referral, product_id order by `date`
rows between 30 preceding and current row) as cumulative_sum_visitors
from (select
t.`date`,
-- get the last not null value from the partition window w for country_id, referral & product_id
last_value(t.country_id, true) over w as country_id,
last_value(t.referral, true) over w as referral
last_value(t.product_id, true) over w as product_id
if(visitors = null, 0, visitors) as visitors
from (
select
d.`date`,
re.country_id,
re.referral,
re.product_id,
sum(visitors) as visitors
from list_of_dates d
left outer join view_of_referred_events re on d.`date` = re.`date`
and re.referral = "pl"
and re.product_id = "113759"
and re.country_id = "216"
group by d.`date`, re.country_id, re.referral, re.product_id
) t
window w as (partition by t.country_id, t.referral, t.product_id order by t.`date`
rows between unbounded preceding and unbounded following)) t1
order by `date` asc;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.