按組匯總數據框

Question

考慮以下具有4列的數據框：

df = data.frame(A = rnorm(10), B = rnorm(10), C = rnorm(10), D = rnorm(10))

A，B，C，D列屬於不同的組，並且這些組在單獨的數據框中定義：

groups = data.frame(Class = c("A","B","C","D"), Group = c("G1", "G2", "G2", "G1"))

#> groups
#  Class Group
#1     A    G1
#2     B    G2
#3     C    G2
#4     D    G1

我想對屬於同一組的列的元素求平均，並得到類似的結果：

#> res
#            G1          G2
#1  -0.30023039 -0.71075139
#2   0.53053443 -0.12397126
#3   0.21968567 -0.46916160
#4  -1.13775100 -0.61266026
#5   1.30388130 -0.28021734
#6   0.29275876 -0.03994522
#7  -0.09649998  0.59396983
#8   0.71334020 -0.29818438
#9  -0.29830924 -0.47094084
#10 -0.36102888 -0.40181739

其中G1的每個像元是A和D的相對像元的平均值，而G2的每個像元是B和C的相對像元的平均值，等等。

我能夠實現這一結果，但是采用了一種蠻力的方式：

l = levels(groups$Group)
res = data.frame(matrix(nc = length(levels), nr = nrow(df)))
for(i in 1:length(l)) {
    df.sub = df[which(groups$Group == l[i])]
    res[,i] = apply(df.sub, 1, mean)
}
names(res) <- l

有更好的方法嗎？ 實際上，我有20多個專欄和10多個小組。

謝謝！

Answer 1

使用data.table

library(data.table)
groups <- data.table(groups, key="Group")
DT <- data.table(df)

groups[, rowMeans(DT[, Class, with=FALSE]), by=Group][, setnames(as.data.table(matrix(V1, ncol=length(unique(Group)))), unique(Group))]

             G1         G2
 1: -0.13052091 -0.3667552
 2:  1.17178729 -0.5496347
 3:  0.23115841  0.8317714
 4:  0.45209516 -1.2180895
 5: -0.01861638 -0.4174929
 6: -0.43156831  0.9008427
 7: -0.64026238  0.1854066
 8:  0.56225108 -0.3563087
 9: -2.00405840 -0.4680040
10:  0.57608055 -0.6177605



# Also, make sure you have characters, not factors, 
groups[, Class := as.character(Class)]
groups[, Group := as.character(Group)]

簡單的基礎：

 tapply(groups$Class, groups$Group, function(X) rowMeans(df[, X]))

使用sapply ：

 sapply(unique(groups$Group), function(X) 
     rowMeans(df[, groups[groups$Group==X, "Class"]]) )

Answer 2

我個人會使用里卡多的解決方案，但另一個選擇是先merge兩個數據集，然后使用首選的聚合方法。

library(reshape2)

## Retain the "rownames" so we can aggregate by row
temp <- merge(cbind(id = rownames(df), melt(df)), groups, 
              by.x = "variable", by.y = "Class")
head(temp)
#   variable id      value Group
# 1        A  1 -0.6264538    G1
# 2        A  2  0.1836433    G1
# 3        A  3 -0.8356286    G1
# 4        A  4  1.5952808    G1
# 5        A  5  0.3295078    G1
# 6        A  6 -0.8204684    G1

## This is the perfect form for `dcast` to do its work
dcast(temp, id ~ Group, value.var="value", mean)
#    id          G1          G2
# 1   1  0.36611287  1.21537927
# 2  10  0.22889368  0.50592144
# 3   2  0.04042780  0.58598977
# 4   3 -0.22397850 -0.27333780
# 5   4  0.77073788 -2.10202579
# 6   5 -0.52377589  0.87237833
# 7   6 -0.61773147 -0.05053117
# 8   7  0.04656955 -0.08599288
# 9   8  0.33950565 -0.26345809
# 10  9  0.83790336  0.17153557

（在樣本“ df”上使用set.seed(1)數據。

按組匯總數據框

問題描述

2 個解決方案

解決方案1
3 已采納 2013-10-23 16:19:53

使用data.table

解決方案2
0 2013-10-23 16:33:12

按組匯總數據框

問題描述

2 個解決方案

解決方案1 3 已采納 2013-10-23 16:19:53

使用data.table

解決方案2 0 2013-10-23 16:33:12

解決方案1
3 已采納 2013-10-23 16:19:53

解決方案2
0 2013-10-23 16:33:12