I have a dataframe that contains bank assets for several dates (times). Each bank has a unique ID:
# Sample Data
time <- c(51, 52, 53, 55, 56, 51, 52, 51, 52, 53)
id <- c(1234, 1234, 1234, 1234, 1234, 2345, 2345, 3456, 3456, 3456)
name <- c("BANK A", "BANK A", "BANK A", "BANK A", "BANK A", "BANK B", "BANK B", "BANK C",
"BANK C", "BANK C")
assets <- c(5000, 6000, 4000, 7000, 8000, 10000, 12000, 30000, 35000, 40000)
df <- data.frame(time, id, name, assets)
> df
time id name assets
1 51 1234 BANK A 5000
2 52 1234 BANK A 6000
3 53 1234 BANK A 4000
4 55 1234 BANK A 7000
5 56 1234 BANK A 8000
6 51 2345 BANK B 10000
7 52 2345 BANK B 12000
8 51 3456 BANK C 30000
9 52 3456 BANK C 35000
10 53 3456 BANK C 40000
For each bank I want to calculate the rolling mean of the assets, varying the width according to the number of consecutive time values. So the rolling mean shall include all availabe consecutive previous values of a bank's asssets. If there is no previous value availabe for one bank it shall equal assets. Therefor I add a column that counts the number of consecutive time-values and than use rollapplyr
from the zoo package, which gives me the desired result, but with a large data set it is far too slow:
# Calculate number of consecutive times
require(dplyr)
df <- df %>%
mutate(number.time = 1) %>% # insert column for number.time, start value = 1
group_by(id) %>%
arrange(time) # correct order for moving average
for(i in 2:nrow(df)) # Start loop in second row, end in last row of df
df$number.time[i] <-
ifelse(df$time[i] == df$time[i-1]+1, # Is time consecutive?
df$number.time[i - 1] + 1, # If yes: add 1 to previous number.time
1) # If no: set number.time = 1
# Moving Average
require(zoo)
df %>%
mutate(mov.average = rollapplyr(data = assets,
width = number.time, # use number.time for width
FUN = mean,
fill = NA,
na.rm = TRUE))
Source: local data frame [10 x 6]
Groups: id [3]
time id name assets number.time mov.average
(dbl) (dbl) (fctr) (dbl) (dbl) (dbl)
1 51 1234 BANK A 5000 1 5000
2 52 1234 BANK A 6000 2 5500
3 53 1234 BANK A 4000 3 5000
4 55 1234 BANK A 7000 1 7000
5 56 1234 BANK A 8000 2 7500
6 51 2345 BANK B 10000 1 10000
7 52 2345 BANK B 12000 2 11000
8 51 3456 BANK C 30000 1 30000
9 52 3456 BANK C 35000 2 32500
10 53 3456 BANK C 40000 3 35000
How could I get this output using a faster function? I'm aware of rollmean
from zoo as well as SMA
from TTR and ma
from forecast but these do not allow for varying width. My question may also be related to this question and this rblog , but I'm not familiar with C++ nor do I know a lot about function writing, so I do not really understand those posts.
EDIT 1: Note that in my code above it isn't the for
-loop but the rollapplyr that takes a lot of time.
EDIT 2: The rolling mean shall include not more than the last 4 values. This is, as many consecutive values as there are according to the time-variable, but no more than the last 4 values. Sorry for the inexact question! :/ My wording was based on the assumption to use the "number.time"-column where it would have been easy to limit all values to maximum = 4.
First create a grouping variable g
and then compute the rolling means. Note that rollsum
is substantially faster than rollapply
but does not support partial
necessitating the workaround shown:
library(zoo) # rollsum
g <- with(df, cumsum(ave(time, id, FUN = function(x) c(1, diff(x) != 1))))
roll4 <- function(x) rollsum(c(0, 0, 0, x), 4) / pmin(4, seq_along(x))
transform(df, avg = ave(assets, g, FUN = roll4))
giving:
time id name assets avg
1 51 1234 BANK A 5000 5000
2 52 1234 BANK A 6000 5500
3 53 1234 BANK A 4000 5000
4 55 1234 BANK A 7000 7000
5 56 1234 BANK A 8000 7500
6 51 2345 BANK B 10000 10000
7 52 2345 BANK B 12000 11000
8 51 3456 BANK C 30000 30000
9 52 3456 BANK C 35000 32500
10 53 3456 BANK C 40000 35000
Use cumsum
.
If you have just one bank, try:
cumsum(df$assets)/seq(nrow(df))
What to do if you have more than one bank, I leave as an excersize. Hint: you can completely avoid loops by using rle
.
Here is the function "cumsum with restarts" which is supposed to help you.
cumsum.r <- function(vals, restart) {
if (!is.vector(vals) || !is.vector(restart)) stop("expect vectors")
if (length(vals) != length(restart)) stop("different length")
# assume restart = FFTFFFTFFFFT
len = length(vals) # 12
restart[1]=T # TFTFFFTFFFFT
ind = which(restart) # (1,3,7,12)
ind = rep(ind, c(ind[-1],len+1)-ind) # 1,1,3,3,3,3,7,7,7,7,7,12
vals.c = cumsum(vals)
vals.c - vals.c[ind] + vals[ind]
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.