[英]R - Fast way to calculate rolling mean with varying width
I have a dataframe that contains bank assets for several dates (times).我有一个包含多个日期(时间)的银行资产的数据框。 Each bank has a unique ID:每家银行都有一个唯一的 ID:
# Sample Data
time <- c(51, 52, 53, 55, 56, 51, 52, 51, 52, 53)
id <- c(1234, 1234, 1234, 1234, 1234, 2345, 2345, 3456, 3456, 3456)
name <- c("BANK A", "BANK A", "BANK A", "BANK A", "BANK A", "BANK B", "BANK B", "BANK C",
"BANK C", "BANK C")
assets <- c(5000, 6000, 4000, 7000, 8000, 10000, 12000, 30000, 35000, 40000)
df <- data.frame(time, id, name, assets)
> df
time id name assets
1 51 1234 BANK A 5000
2 52 1234 BANK A 6000
3 53 1234 BANK A 4000
4 55 1234 BANK A 7000
5 56 1234 BANK A 8000
6 51 2345 BANK B 10000
7 52 2345 BANK B 12000
8 51 3456 BANK C 30000
9 52 3456 BANK C 35000
10 53 3456 BANK C 40000
For each bank I want to calculate the rolling mean of the assets, varying the width according to the number of consecutive time values.对于每家银行,我想计算资产的滚动平均值,根据连续时间值的数量改变宽度。 So the rolling mean shall include all availabe consecutive previous values of a bank's asssets.因此,滚动平均值应包括银行资产的所有可用连续先前值。 If there is no previous value availabe for one bank it shall equal assets.如果一家银行没有可用的先前价值,则它应等于资产。 Therefor I add a column that counts the number of consecutive time-values and than use rollapplyr
from the zoo package, which gives me the desired result, but with a large data set it is far too slow:因此,我添加了一列计算连续时间值的数量,然后使用 zoo 包中的rollapplyr
,这给了我想要的结果,但是对于大数据集,它太慢了:
# Calculate number of consecutive times
require(dplyr)
df <- df %>%
mutate(number.time = 1) %>% # insert column for number.time, start value = 1
group_by(id) %>%
arrange(time) # correct order for moving average
for(i in 2:nrow(df)) # Start loop in second row, end in last row of df
df$number.time[i] <-
ifelse(df$time[i] == df$time[i-1]+1, # Is time consecutive?
df$number.time[i - 1] + 1, # If yes: add 1 to previous number.time
1) # If no: set number.time = 1
# Moving Average
require(zoo)
df %>%
mutate(mov.average = rollapplyr(data = assets,
width = number.time, # use number.time for width
FUN = mean,
fill = NA,
na.rm = TRUE))
Source: local data frame [10 x 6]
Groups: id [3]
time id name assets number.time mov.average
(dbl) (dbl) (fctr) (dbl) (dbl) (dbl)
1 51 1234 BANK A 5000 1 5000
2 52 1234 BANK A 6000 2 5500
3 53 1234 BANK A 4000 3 5000
4 55 1234 BANK A 7000 1 7000
5 56 1234 BANK A 8000 2 7500
6 51 2345 BANK B 10000 1 10000
7 52 2345 BANK B 12000 2 11000
8 51 3456 BANK C 30000 1 30000
9 52 3456 BANK C 35000 2 32500
10 53 3456 BANK C 40000 3 35000
How could I get this output using a faster function?如何使用更快的函数获得此输出? I'm aware of rollmean
from zoo as well as SMA
from TTR and ma
from forecast but these do not allow for varying width.我知道来自动物园的rollmean
以及来自 TTR 的SMA
和来自预测的ma
但这些不允许改变宽度。 My question may also be related to this question and this rblog , but I'm not familiar with C++ nor do I know a lot about function writing, so I do not really understand those posts.我的问题可能也与this question和this rblog有关,但我不熟悉C++,也不太了解函数编写,所以我不太了解那些帖子。
EDIT 1: Note that in my code above it isn't the for
-loop but the rollapplyr that takes a lot of time.编辑 1:请注意,在我上面的代码中,它不是for
循环,而是需要大量时间的 rollapplyr。
EDIT 2: The rolling mean shall include not more than the last 4 values.编辑 2:滚动平均值应包括不超过最后 4 个值。 This is, as many consecutive values as there are according to the time-variable, but no more than the last 4 values.也就是说,根据时间变量有多少个连续值,但不超过最后 4 个值。 Sorry for the inexact question!对不起,不准确的问题! :/ My wording was based on the assumption to use the "number.time"-column where it would have been easy to limit all values to maximum = 4. :/我的措辞基于使用“number.time”列的假设,其中很容易将所有值限制为最大值 = 4。
First create a grouping variable g
and then compute the rolling means.首先创建一个分组变量g
,然后计算滚动平均值。 Note that rollsum
is substantially faster than rollapply
but does not support partial
necessitating the workaround shown:请注意, rollsum
比rollapply
但不支持partial
需要所示的解决方法:
library(zoo) # rollsum
g <- with(df, cumsum(ave(time, id, FUN = function(x) c(1, diff(x) != 1))))
roll4 <- function(x) rollsum(c(0, 0, 0, x), 4) / pmin(4, seq_along(x))
transform(df, avg = ave(assets, g, FUN = roll4))
giving:给予:
time id name assets avg
1 51 1234 BANK A 5000 5000
2 52 1234 BANK A 6000 5500
3 53 1234 BANK A 4000 5000
4 55 1234 BANK A 7000 7000
5 56 1234 BANK A 8000 7500
6 51 2345 BANK B 10000 10000
7 52 2345 BANK B 12000 11000
8 51 3456 BANK C 30000 30000
9 52 3456 BANK C 35000 32500
10 53 3456 BANK C 40000 35000
Use cumsum
.使用cumsum
。
If you have just one bank, try:如果您只有一家银行,请尝试:
cumsum(df$assets)/seq(nrow(df))
What to do if you have more than one bank, I leave as an excersize.如果你有不止一家银行怎么办,我把它留作练习。 Hint: you can completely avoid loops by using rle
.提示:您可以使用rle
完全避免循环。
Here is the function "cumsum with restarts" which is supposed to help you.这是“cumsum with restarts”功能,它应该可以帮助您。
cumsum.r <- function(vals, restart) {
if (!is.vector(vals) || !is.vector(restart)) stop("expect vectors")
if (length(vals) != length(restart)) stop("different length")
# assume restart = FFTFFFTFFFFT
len = length(vals) # 12
restart[1]=T # TFTFFFTFFFFT
ind = which(restart) # (1,3,7,12)
ind = rep(ind, c(ind[-1],len+1)-ind) # 1,1,3,3,3,3,7,7,7,7,7,12
vals.c = cumsum(vals)
vals.c - vals.c[ind] + vals[ind]
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.