[英]Rolling mean with differing number of observations
I'm trying to construct a rolling mean for a dataset over the past 6 months. 我正在尝试在过去6个月中为数据集构建滚动平均值。 The data is on a daily basis and has more than 100.000 rows from which I provided a sample below. 该数据每天都有,并且有100.000多行,我在下面提供了一个示例。
# A tibble: 100 × 5
ID MONTH DATE VALUE R_MEAN
<fctr> <dbl> <date> <dbl> <dbl>
1 634 20160200 2016-02-03 2 0.000000
2 1700 20150300 2015-03-02 3 0.000000
3 1700 20150400 2015-04-01 7 3.000000
4 1700 20150400 2015-04-09 1 5.000000
5 1700 20150700 2015-07-02 26 3.666667
6 1700 20150800 2015-08-03 1 9.250000
7 1700 20150900 2015-09-01 2 7.600000
8 1700 20151000 2015-10-01 5 7.400000
9 1700 20151000 2015-10-07 10 7.833333
10 1700 20151100 2015-11-02 8 8.800000
# ... with 90 more rows
My goal is to create a moving average over the past 6 months, so for example for an ID: X and DATE value of 20160101 I want to get the average VALUE of all rows which have the same ID and where the DATE value is between 20150601 and 20160101. When no previous values are available I assume an average value of zero. 我的目标是创建过去6个月的移动平均值,例如,对于一个ID:X和DATE值为20160101,我想获得所有具有相同ID且DATE值介于20150601之间的行的平均值和20160101。当没有以前的值可用时,我假设平均值为零。
I thought of using some sort of expanding grid approach, but as I have a lot of ID's (close to 30.000), expanding the grid on aa daily basis over a period of 2 years would result in an enormous grid. 我曾想过使用某种扩展网格方法,但是由于我有很多ID(接近30.000),因此在2年的时间内每天扩展网格会导致巨大的网格。
Here I use dplyr
. 在这里我使用dplyr
。 I inner_join
the table on itself, then filter the relevant previous rows, per row in the source data, and calculate the mean value. 我使用inner_join
表本身,然后过滤源数据中每行的相关先前行,并计算平均值。
Finally I left_join
the original data on the processed data and replace NA
using coalesce
. 最后,我left_join
在处理数据的原始数据和替换NA
使用coalesce
。
The 6 months window is calculated by substracting 182 days from the DATE
. 通过减去DATE
182天来计算6个月的时间范围。 You could also use lubridate
to make it a period in months. 您也可以使用lubridate
将其lubridate
几个月。 Personally I prefer to work with a fixed window of days, that does not depend on the different amount of days each month has. 就我个人而言,我更喜欢使用固定的天数,而不取决于每个月的天数。
str <- '
row ID MONTH DATE VALUE R_MEAN
1 634 20160200 2016-02-03 2 0.000000
2 1700 20150300 2015-03-02 3 0.000000
3 1700 20150400 2015-04-01 7 3.000000
4 1700 20150400 2015-04-09 1 5.000000
5 1700 20150700 2015-07-02 26 3.666667
6 1700 20150800 2015-08-03 1 9.250000
7 1700 20150900 2015-09-01 2 7.600000
8 1700 20151000 2015-10-01 5 7.400000
9 1700 20151000 2015-10-07 10 7.833333
10 1700 20151100 2015-11-02 8 8.800000
'
file <- textConnection(str)
raw <- read.table(file, header = T)
library(dplyr)
df <- raw %>% mutate(DATE = as.Date(DATE,'%Y-%m-%d'))
prev <- df %>% inner_join(df, by = 'ID') %>%
filter(DATE.y > DATE.x-182, DATE.y < DATE.x) %>%
group_by(row.x) %>% summarise(meanVALUE = mean(VALUE.y)) %>%
rename(row = row.x)
df %>% left_join(prev, by='row') %>% mutate(meanVALUE = coalesce(meanVALUE,0))
result: 结果:
row ID MONTH DATE VALUE R_MEAN meanVALUE
1 1 634 20160200 2016-02-03 2 0.000000 0.000000
2 2 1700 20150300 2015-03-02 3 0.000000 0.000000
3 3 1700 20150400 2015-04-01 7 3.000000 3.000000
4 4 1700 20150400 2015-04-09 1 5.000000 5.000000
5 5 1700 20150700 2015-07-02 26 3.666667 3.666667
6 6 1700 20150800 2015-08-03 1 9.250000 9.250000
7 7 1700 20150900 2015-09-01 2 7.600000 8.750000
8 8 1700 20151000 2015-10-01 5 7.400000 7.500000
9 9 1700 20151000 2015-10-07 10 7.833333 7.000000
10 10 1700 20151100 2015-11-02 8 8.800000 8.800000
Maybe this helps: 也许这会有所帮助:
for (i in 1:levels(df$ID))
mean(df$value[df$DATE>(Sys.date()-182) &
df$ID==levels(df$ID)[i]],
na.rm=T)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.