[英]Rolling Mean By Group Dplyr/data.table
我有一個包含 350k 行的大型 data.frame。 目前我正在嘗試在data.table 中創建lag_close 列的滾動平均值。 由於我在dplyr
比在data.table
更流利, dplyr
我嘗試使用dplyr
創建一個滾動平均值,它給我一個錯誤。
df = df %>% na.omit()%>% group_by(ticker) %>% mutate(avg10 = rollapplyr(lag_close, 10, mean))
錯誤:列
avg10
長度必須為 577(組大小)或 1,而不是 567
我的下一個嘗試是走 data.table 路線,因為我看到人們取得了成功。 沒有返回錯誤,但是輸出很奇怪而且不准確。
df = as.data.table(df)
df[, laggedRets := rollapplyr(lag_close, 10, mean, na.rm = T), by = ticker]
tail(df, 10)
ticker ret.adjusted.prices ret.closing.prices lag_close jump laggedRets
1: MELI 0.002731904 0.002731904 501.48 0.020174791 507.849
2: MELI 0.007258614 0.007258614 502.85 0.001966823 507.528
3: MELI -0.002606134 -0.002606134 506.50 -0.011437339 507.085
4: MELI 0.018132218 0.018132218 505.18 0.003832865 506.282
5: MELI -0.020628408 -0.020628408 514.34 -0.014373574 505.262
6: MELI 0.001568257 0.001568257 503.73 0.003428456 NA
7: MELI 0.018155879 0.018155879 504.52 -0.002957652 NA
8: MELI -0.022815765 -0.022815765 513.68 0.002663470 NA
9: MELI -0.007132015 -0.007132015 501.96 -0.005273231 NA
10: MELI -0.012801487 -0.012801487 498.38 -0.016467802 NA
head(df, 10)
ticker ret.adjusted.prices ret.closing.prices lag_close jump laggedRets
1: AA 0.0496011100 0.0496011100 28.83 0.006568736 NA
2: AA 0.0128883014 0.0128883014 30.26 -0.002647255 NA
3: AA 0.0009787928 0.0009787928 30.65 0.001304239 NA
4: AA -0.0391134289 -0.0391134289 30.68 0.006497716 NA
5: AA 0.0508819539 0.0508819539 29.48 0.011802531 31.154
6: AA 0.0319560684 0.0319560684 30.98 0.003222691 31.752
7: AA 0.0334689407 0.0334689407 31.97 0.031709630 32.268
8: AA -0.0009080811 -0.0009080811 33.04 -0.006680893 32.769
9: AA -0.0112086950 -0.0112086950 33.01 -0.001212398 33.330
10: AA 0.0664829064 0.0664829064 32.64 0.009452792 34.132
似乎 data.table 的兩側都有 NA,但事實並非如此。 而且數字也不准確。 我不確定我錯過了什么。 這是一個示例,以便你們可以測試一下。
df =structure(list(ticker = c("AA", "AA", "AA", "AA", "AA", "AA",
"AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA",
"AA", "AA", "AA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA",
"AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA",
"AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AAL", "AAL",
"AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL",
"AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL",
"AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI",
"AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI",
"AAOI", "AAOI", "AAOI", "AAOI", "AAPL", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",
"ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV",
"ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV",
"ABBV", "ABBV", "ABBV", "ABBV"), lag_close = c(28.83, 30.26,
30.65, 30.68, 29.48, 30.98, 31.969999, 33.040001, 33.009998,
32.639999, 34.810001, 35.419998, 35.66, 36.290001, 37.5, 36.529999,
36.299999, 36.669998, 36.549999, 36.450001, 38.900002, 40.060001,
41.34, 41.23, 41.34, 42.299999, 42.59, 42.110001, 42.27, 41.990002,
42.029999, 42.09, 42.049999, 42.400002, 43.900002, 44.939999,
44.549999, 44.419998, 43.93, 44.07, 46.299999, 46.700001, 45.889999,
46.209999, 47.080002, 48.48, 48.639999, 48.099998, 47.650002,
46.75, 47.639999, 47.259998, 48, 46.939999, 47.540001, 47.91,
49.59, 46.950001, 44.900002, 44.25, 23.1, 23.51, 22.809999, 22.42,
22.24, 22.870001, 22.719999, 29.860001, 28.219999, 30.440001,
30.84, 30.48, 30.58, 31.290001, 32, 31.709999, 31.280001, 31.15,
30.41, 30.76, 116.150002, 116.019997, 116.610001, 117.910004,
118.989998, 119.110001, 119.75, 119.25, 119.040001, 120, 119.989998,
119.779999, 120, 120.080002, 119.970001, 121.879997, 121.940002,
121.949997, 121.629997, 121.349998, 62.41, 63.290001, 63.77,
63.790001, 64.209999, 64.07, 61.139999, 61.279999, 61.990002,
61.860001, 61.66, 61.380001, 61.150002, 60.959999, 60.580002,
61.380001, 61.27, 60, 60.43, 61.110001)), row.names = c(NA, -120L
), class = "data.frame", .Names = c("ticker", "lag_close"))
這是data.table()
方法,第 1-9 行為 NA。
dt <- as.data.table(df)
dt[, laggedRets := frollmean(lag_close, 10, na.rm = T), by = ticker]
此外,尚不清楚是否應在滾動平均值之前刪除NA
。 如果是這種情況,您將使用:
dt <- as.data.table(df)
dt[!is.na(lag_close), laggedRets := frollmean(lag_close, 10), by = ticker]
您可以使用purrr
包中的map
並將其應用於1:n()
:
df = df %>%
na.omit() %>%
group_by(ticker) %>%
mutate(avg10 = map_dbl(1:n(), ~mean(lag_close[(max(.x-9, 1)):.x], na.rm =T))
當然,您必須決定對少於 10 個觀察值的前 9 行應該發生什么。 在我的解決方案中,第 1 行到第 9 行包含最后 1 到 9 個觀察值的平均值。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.