[英]grouping table by multiple factors and spreading it from long format to wide - the data.table way in R
[英]data.table grouping across multiple sequential factors
我正在嘗試估算data.table
n個因子的一些參數。 雖然我熟悉使用by
功能來執行某個操作的操作; 為多個順序因素執行此操作會導致一些問題。
例如,使用簡化集
df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)), Variable = round(rnorm(16), 2))
Group Variable
1: A 0.13
2: A 0.26
3: B -1.36
4: B -0.78
5: B -0.92
6: C 0.00
7: C -2.49
8: D -1.85
9: D 0.37
10: D -0.57
11: D 1.42
12: E -0.72
13: F -1.04
14: F 1.86
15: F 0.49
16: F 1.61
使用df[, mean(Variable), by = Group]
將給出每個Group的均值。 但是,我想計算前n組的平均值。
我已經嘗試使用M[, zoo::rollapply(Variable, n, mean), by = Group]
,但是,因為使用固定n的組具有不同的大小將不起作用。
想要的功能類似於df[, mean(Variable), by = "This Group and previous n Groups]
。
我想要實現的輸出(對於n = 3的情況)看起來像
Group Variable
1: A NA
2: A NA
3: B NA
4: B NA
5: B NA
6: C 0.13
7: C 0.13
8: D -1.36
9: D -1.36
10: D -1.36
11: D -1.36
12: E 0
13: F -1.85
14: F -1.85
15: F -1.85
16: F -1.85
任何幫助,將不勝感激。
library(data.table)
library(RcppRoll)
df1 <- df[, .(n=.N, S=sum(Variable)), by = Group]
df1[, NewVariable:=roll_sum(S, 3, align="right", fill=NA)/roll_sum(n, 3, align="right", fill=NA),]
df[df1, on="Group"]
Group Variable n S NewVariable
1: A -0.63 2 -0.45 NA
2: A 0.18 2 -0.45 NA
3: B -0.84 3 1.09 NA
4: B 1.60 3 1.09 NA
5: B 0.33 3 1.09 NA
6: C -0.82 2 -0.33 0.04428571
7: C 0.49 2 -0.33 0.04428571
8: D 0.74 4 2.52 0.36444444
9: D 0.58 4 2.52 0.36444444
10: D -0.31 4 2.52 0.36444444
11: D 1.51 4 2.52 0.36444444
12: E 0.39 1 0.39 0.36857143
13: F -0.62 4 -1.75 0.12888889
14: F -2.21 4 -1.75 0.12888889
15: F 1.12 4 -1.75 0.12888889
16: F -0.04 4 -1.75 0.12888889
我希望我的解決方案不言自明。
dplyr
等價物是
df %>%
group_by(Group) %>%
summarise(n=n(), S=sum(Variable)) %>%
mutate(NewVar=roll_sum(S, 3, align="right", fill=NA)/roll_sum(n, 3, align="right", fill=NA)) %>%
left_join(df, by="Group")
數據
set.seed(1)
df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)), Variable = round(rnorm(16), 2))
包裝信息
[1] RcppRoll_0.2.2 data.table_1.9.5
這可能不是最有效的方式,但它有效:
首先,讓我們設置種子的可重復性:
set.seed(1038)
> df
Group Variable
1: A -0.86
2: A 0.57
3: B 0.10
4: B -1.57
5: B 1.73
6: C -0.56
7: C 0.54
8: D -1.71
9: D -0.47
10: D -1.00
11: D 1.03
12: E -0.47
13: F -1.06
14: F -2.06
15: F -0.57
16: F 1.70
現在消除轉換Group
為整數以使n-1
更有形,然后通過grp_no
壓縮所有多個觀察:
setkey(df[ , grp_no := as.integer(as.factor(Group))], grp_no)
df_ttls <- df[ , .(ttl = sum(Variable), .N), by = grp_no]
> df_ttls
grp_no ttl N
1: 1 -0.29 2
2: 2 0.26 3
3: 3 -0.02 2
4: 4 -2.15 4
5: 5 -0.47 1
6: 6 -1.99 4
現在創建使用shift
搜索的加權平均值:
df_ttls[ , lag3avg := rowSums(sapply(0:2, shift, x = ttl))/
rowSums(sapply(0:2, shift, x = N))]
並合並回到完整的數據集:
df[df_ttls, lag3avg := i.lag3avg][ ]
Group Variable grp_no lag3avg
1: A -0.86 1 NA
2: A 0.57 1 NA
3: B 0.10 2 NA
4: B -1.57 2 NA
5: B 1.73 2 NA
6: C -0.56 3 -0.007142857
7: C 0.54 3 -0.007142857
8: D -1.71 4 -0.212222222
9: D -0.47 4 -0.212222222
10: D -1.00 4 -0.212222222
11: D 1.03 4 -0.212222222
12: E -0.47 5 -0.377142857
13: F -1.06 6 -0.512222222
14: F -2.06 6 -0.512222222
15: F -0.57 6 -0.512222222
16: F 1.70 6 -0.512222222
請注意,這可以很容易地擴展到一個功能:
k_lag_avg <- function(k){
df[df_ttls[ , .(grp_no, rowSums(sapply(1:k - 1L, shift, x = ttl))/
rowSums(sapply(1:k -1L, shift, x = N)))],
paste0("lag", k, "avg") := i.V2]
}
k_lag_avg(5L); df[ ]
Group Variable grp_no lag3avg lag5avg
1: A -0.86 1 NA NA
2: A 0.57 1 NA NA
3: B 0.10 2 NA NA
4: B -1.57 2 NA NA
5: B 1.73 2 NA NA
6: C -0.56 3 -0.007142857 NA
7: C 0.54 3 -0.007142857 NA
8: D -1.71 4 -0.212222222 NA
9: D -0.47 4 -0.212222222 NA
10: D -1.00 4 -0.212222222 NA
11: D 1.03 4 -0.212222222 NA
12: E -0.47 5 -0.377142857 -0.2225000
13: F -1.06 6 -0.512222222 -0.3121429
14: F -2.06 6 -0.512222222 -0.3121429
15: F -0.57 6 -0.512222222 -0.3121429
16: F 1.70 6 -0.512222222 -0.3121429
如果您願意將data.table轉換為data.frame並執行此過程,我可以幫助您。 查看此示例並逐步執行命令以查看其工作原理。 這個例子涉及你提到的n = 3的情況。
library(dplyr)
df <- data.frame(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)),
Variable = round(rnorm(16), 2))
df %>% group_by(Group) %>%
do(data.frame(df2 = df)) %>%
mutate(diff = as.numeric(Group) - as.numeric(df2.Group)) %>%
filter(diff %in% 0:2) %>%
mutate(unique_pairs = n_distinct(diff)) %>%
filter(unique_pairs ==3) %>%
mutate(Mean = mean(df2.Variable)) %>%
filter(diff==0) %>%
select(Group, Mean) %>%
ungroup
理念只是創建“組”名稱之間的所有組合,然后創建一些有用的列來過濾。 您可以使用for循環執行此過程,但我希望它更慢。
如果你真的想使用data.table(仍然是dplyr但后台的data.table結構)試試這個:
library(dplyr)
library(data.table)
df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)),
Variable = round(rnorm(16), 2))
df = df %>% mutate(Group2 = as.numeric(as.factor(Group)))
df %>%
group_by(Group2, Group) %>%
do(data.table(df2 = df)) %>%
mutate(diff = Group2 - df2.Group2) %>%
filter(diff %in% 0:2) %>%
group_by(Group2, Group) %>%
mutate(unique_pairs = n_distinct(diff)) %>%
filter(unique_pairs ==3) %>%
group_by(Group2, Group) %>%
mutate(Mean = mean(df2.Variable)) %>%
filter(diff==0) %>%
select(Group2, Group, Mean) %>%
ungroup
這里data.table不喜歡因素,所以我不得不使用數字而不是Group變量的字母。 此外,在每次變異之后我不得不再次分組(當你想在后台使用data.table時,這是一個已知的dplyr問題)。 雖然哲學完全相同。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.