繁体   English   中英

通过索引将滚动均值应用于数据库

[英]apply a rolling mean to a database by an index

我想通过多个id计算单个数据帧中数据的滚动均值。 请参阅下面的示例数据集。

date <- as.Date(c("2015-02-01", "2015-02-02", "2015-02-03", "2015-02-04", 
          "2015-02-05", "2015-02-06", "2015-02-07", "2015-02-08",  
          "2015-02-09", "2015-02-10", "2015-02-01", "2015-02-02", 
          "2015-02-03", "2015-02-04", "2015-02-05", "2015-02-06", 
          "2015-02-07", "2015-02-08", "2015-02-09", "2015-02-10"))
index <- c("a","a","a","a","a","a","a","a","a","a",
           "b","b","b","b","b","b","b","b","b","b")
x <- runif(20,1,100)
y <- runif(20,50,150)
z <- runif(20,100,200)

df <- data.frame(date, index, x, y, z)

我想用a计算x,y和z的滚动均值,然后用b计算。

我尝试了以下,但我收到了一个错误。

test <- tapply(df, df$index, FUN = rollmean(df, 5, fill=NA))

错误:

Error in xu[k:n] - xu[c(1, seq_len(n - k))] : 
  non-numeric argument to binary operator

看起来像索引是一个字符的问题,但我需要它来计算方法......

1)ave尝试ave而不是tapply并确保它仅应用于感兴趣的列,即第3,4,5列。

roll <- function(x) rollmean(x, 5, fill = NA)
cbind(df[1:2], lapply(df[3:5], function(x) ave(x, df$index, FUN = roll)))

赠送:

         date index        x         y        z
1  2015-02-01     a       NA        NA       NA
2  2015-02-02     a       NA        NA       NA
3  2015-02-03     a 66.50522 127.45650 129.8472
4  2015-02-04     a 61.71320 123.83633 129.7673
5  2015-02-05     a 56.56125 120.86158 126.1371
6  2015-02-06     a 66.13340 119.93428 127.1819
7  2015-02-07     a 59.56807 105.83208 125.1244
8  2015-02-08     a 49.98779  95.66024 139.2321
9  2015-02-09     a       NA        NA       NA
10 2015-02-10     a       NA        NA       NA
11 2015-02-01     b       NA        NA       NA
12 2015-02-02     b       NA        NA       NA
13 2015-02-03     b 55.71327 117.52219 139.3961
14 2015-02-04     b 54.58450 107.81763 142.6101
15 2015-02-05     b 50.48102 104.94084 136.3167
16 2015-02-06     b 37.89790  95.45489 135.4044
17 2015-02-07     b 33.05259  85.90916 150.8673
18 2015-02-08     b 49.91385  90.04940 147.1376
19 2015-02-09     b       NA        NA       NA
20 2015-02-10     b       NA        NA       NA

2)通过另一种方法是使用by roll2处理一个组, by将它应用于生成by列表的每个组, do.call("rbind", ...)将它重新组合在一起。

roll2 <- function(x) cbind(x[1:2], rollmean(x[3:5], 5, fill = NA))
do.call("rbind", by(df, df$index, roll2))

赠送:

           date index        x         y        z
a.1  2015-02-01     a       NA        NA       NA
a.2  2015-02-02     a       NA        NA       NA
a.3  2015-02-03     a 66.50522 127.45650 129.8472
a.4  2015-02-04     a 61.71320 123.83633 129.7673
a.5  2015-02-05     a 56.56125 120.86158 126.1371
a.6  2015-02-06     a 66.13340 119.93428 127.1819
a.7  2015-02-07     a 59.56807 105.83208 125.1244
a.8  2015-02-08     a 49.98779  95.66024 139.2321
a.9  2015-02-09     a       NA        NA       NA
a.10 2015-02-10     a       NA        NA       NA
b.11 2015-02-01     b       NA        NA       NA
b.12 2015-02-02     b       NA        NA       NA
b.13 2015-02-03     b 55.71327 117.52219 139.3961
b.14 2015-02-04     b 54.58450 107.81763 142.6101
b.15 2015-02-05     b 50.48102 104.94084 136.3167
b.16 2015-02-06     b 37.89790  95.45489 135.4044
b.17 2015-02-07     b 33.05259  85.90916 150.8673
b.18 2015-02-08     b 49.91385  90.04940 147.1376
b.19 2015-02-09     b       NA        NA       NA
b.20 2015-02-10     b       NA        NA       NA

3)宽泛的形式另一种方法是将df从长形式转换为宽形式,在这种情况下,普通的rollmean将会这样做。

rollmean(read.zoo(df, split = 2), 5, fill = NA)

赠送:

                x.a       y.a      z.a      x.b       y.b      z.b
2015-02-01       NA        NA       NA       NA        NA       NA
2015-02-02       NA        NA       NA       NA        NA       NA
2015-02-03 66.50522 127.45650 129.8472 55.71327 117.52219 139.3961
2015-02-04 61.71320 123.83633 129.7673 54.58450 107.81763 142.6101
2015-02-05 56.56125 120.86158 126.1371 50.48102 104.94084 136.3167
2015-02-06 66.13340 119.93428 127.1819 37.89790  95.45489 135.4044
2015-02-07 59.56807 105.83208 125.1244 33.05259  85.90916 150.8673
2015-02-08 49.98779  95.66024 139.2321 49.91385  90.04940 147.1376
2015-02-09       NA        NA       NA       NA        NA       NA
2015-02-10       NA        NA       NA       NA        NA       NA

这是有效的,因为两个组的日期相同。 如果日期不同,那么它可能会引入rollmeanrollmean无法处理这些。 在那种情况下使用

rollapply(read.zoo(df, split = 2), 5, mean, fill = NA)

注意:由于输入在其定义中使用随机数使其可重现,因此我们必须首先发出set.seed 我们用过这个:

set.seed(123)
date <- as.Date(c("2015-02-01", "2015-02-02", "2015-02-03", "2015-02-04", 
          "2015-02-05", "2015-02-06", "2015-02-07", "2015-02-08",  
          "2015-02-09", "2015-02-10", "2015-02-01", "2015-02-02", 
          "2015-02-03", "2015-02-04", "2015-02-05", "2015-02-06", 
          "2015-02-07", "2015-02-08", "2015-02-09", "2015-02-10"))
index <- c("a","a","a","a","a","a","a","a","a","a",
           "b","b","b","b","b","b","b","b","b","b")
x <- runif(20,1,100)
y <- runif(20,50,150)
z <- runif(20,100,200)

这应该是使用库dplyrzoo的技巧:

library(dplyr)
library(zoo)

df %>% 
  group_by(index) %>% 
  mutate(x_mean = rollmean(x, 5, fill = NA),
         y_mean = rollmean(y, 5, fill = NA),
         z_mean = rollmean(z, 5, fill = NA))

您可以使用mutate_each或其他形式的mutate来整理更多内容。

您还可以更改rollmean的参数以满足您的需要,例如align = "right"na.pad = TRUE

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM