简体   繁体   English

R - 计算具有不同宽度的滚动平均值的快速方法

[英]R - Fast way to calculate rolling mean with varying width

I have a dataframe that contains bank assets for several dates (times).我有一个包含多个日期(时间)的银行资产的数据框。 Each bank has a unique ID:每家银行都有一个唯一的 ID:

# Sample Data
time <- c(51, 52, 53, 55, 56, 51, 52, 51, 52, 53)
id <- c(1234, 1234, 1234, 1234, 1234, 2345, 2345, 3456, 3456, 3456)
name <- c("BANK A", "BANK A", "BANK A", "BANK A", "BANK A", "BANK B", "BANK B", "BANK C", 
          "BANK C", "BANK C")
assets <- c(5000, 6000, 4000, 7000, 8000, 10000, 12000, 30000, 35000, 40000)
df <- data.frame(time, id, name, assets)

> df
   time   id   name assets
1    51 1234 BANK A   5000
2    52 1234 BANK A   6000
3    53 1234 BANK A   4000
4    55 1234 BANK A   7000
5    56 1234 BANK A   8000
6    51 2345 BANK B  10000
7    52 2345 BANK B  12000
8    51 3456 BANK C  30000
9    52 3456 BANK C  35000
10   53 3456 BANK C  40000

For each bank I want to calculate the rolling mean of the assets, varying the width according to the number of consecutive time values.对于每家银行,我想计算资产的滚动平均值,根据连续时间值的数量改变宽度。 So the rolling mean shall include all availabe consecutive previous values of a bank's asssets.因此,滚动平均值应包括银行资产的所有可用连续先前值。 If there is no previous value availabe for one bank it shall equal assets.如果一家银行没有可用的先前价值,则它应等于资产。 Therefor I add a column that counts the number of consecutive time-values and than use rollapplyr from the zoo package, which gives me the desired result, but with a large data set it is far too slow:因此,我添加了一列计算连续时间值的数量,然后使用 zoo 包中的rollapplyr ,这给了我想要的结果,但是对于大数据集,它太慢了:

# Calculate number of consecutive times
require(dplyr)
df <- df %>%
  mutate(number.time = 1) %>% # insert column for number.time, start value = 1
  group_by(id) %>%
  arrange(time) # correct order for moving average

for(i in 2:nrow(df)) # Start loop in second row, end in last row of df
  df$number.time[i] <- 
    ifelse(df$time[i] == df$time[i-1]+1,    # Is time consecutive?
           df$number.time[i - 1] + 1,       # If yes: add 1 to previous number.time
           1)                               # If no: set number.time = 1
# Moving Average
require(zoo)
df %>%
  mutate(mov.average = rollapplyr(data = assets,
                                  width = number.time, # use number.time for width
                                  FUN = mean, 
                                  fill = NA,
                                  na.rm = TRUE))
Source: local data frame [10 x 6]
Groups: id [3]

    time    id   name assets number.time mov.average
   (dbl) (dbl) (fctr)  (dbl)       (dbl)       (dbl)
1     51  1234 BANK A   5000           1        5000
2     52  1234 BANK A   6000           2        5500
3     53  1234 BANK A   4000           3        5000
4     55  1234 BANK A   7000           1        7000
5     56  1234 BANK A   8000           2        7500
6     51  2345 BANK B  10000           1       10000
7     52  2345 BANK B  12000           2       11000
8     51  3456 BANK C  30000           1       30000
9     52  3456 BANK C  35000           2       32500
10    53  3456 BANK C  40000           3       35000

How could I get this output using a faster function?如何使用更快的函数获得此输出? I'm aware of rollmean from zoo as well as SMA from TTR and ma from forecast but these do not allow for varying width.我知道来自动物园的rollmean以及来自 TTR 的SMA和来自预测的ma但这些不允许改变宽度。 My question may also be related to this question and this rblog , but I'm not familiar with C++ nor do I know a lot about function writing, so I do not really understand those posts.我的问题可能也与this question和this rblog有关,但我不熟悉C++,也不太了解函数编写,所以我不太了解那些帖子。

EDIT 1: Note that in my code above it isn't the for -loop but the rollapplyr that takes a lot of time.编辑 1:请注意,在我上面的代码中,它不是for循环,而是需要大量时间的 rollapplyr。

EDIT 2: The rolling mean shall include not more than the last 4 values.编辑 2:滚动平均值应包括不超过最后 4 个值。 This is, as many consecutive values as there are according to the time-variable, but no more than the last 4 values.也就是说,根据时间变量有多少个连续值,但不超过最后 4 个值。 Sorry for the inexact question!对不起,不准确的问题! :/ My wording was based on the assumption to use the "number.time"-column where it would have been easy to limit all values to maximum = 4. :/我的措辞基于使用“number.time”列的假设,其中很容易将所有值限制为最大值 = 4。

First create a grouping variable g and then compute the rolling means.首先创建一个分组变量g ,然后计算滚动平均值。 Note that rollsum is substantially faster than rollapply but does not support partial necessitating the workaround shown:请注意, rollsumrollapply但不支持partial需要所示的解决方法:

library(zoo) # rollsum

g <- with(df, cumsum(ave(time, id, FUN = function(x) c(1, diff(x) != 1))))
roll4 <- function(x) rollsum(c(0, 0, 0, x), 4) / pmin(4, seq_along(x)) 
transform(df, avg = ave(assets, g, FUN = roll4))

giving:给予:

   time   id   name assets   avg
1    51 1234 BANK A   5000  5000
2    52 1234 BANK A   6000  5500
3    53 1234 BANK A   4000  5000
4    55 1234 BANK A   7000  7000
5    56 1234 BANK A   8000  7500
6    51 2345 BANK B  10000 10000
7    52 2345 BANK B  12000 11000
8    51 3456 BANK C  30000 30000
9    52 3456 BANK C  35000 32500
10   53 3456 BANK C  40000 35000

Use cumsum .使用cumsum

If you have just one bank, try:如果您只有一家银行,请尝试:

cumsum(df$assets)/seq(nrow(df))

What to do if you have more than one bank, I leave as an excersize.如果你有不止一家银行怎么办,我把它留作练习。 Hint: you can completely avoid loops by using rle .提示:您可以使用rle完全避免循环。

Here is the function "cumsum with restarts" which is supposed to help you.这是“cumsum with restarts”功能,它应该可以帮助您。

cumsum.r <- function(vals, restart) {
    if (!is.vector(vals) || !is.vector(restart)) stop("expect vectors")
    if (length(vals) != length(restart)) stop("different length")
    # assume restart = FFTFFFTFFFFT
    len = length(vals) # 12
    restart[1]=T # TFTFFFTFFFFT
    ind = which(restart) # (1,3,7,12)
    ind = rep(ind, c(ind[-1],len+1)-ind) # 1,1,3,3,3,3,7,7,7,7,7,12
    vals.c = cumsum(vals)
    vals.c - vals.c[ind] + vals[ind]
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM