[英]Colmeans in a dataframe by factor variable
I'm trying to get the mean of some variables inside a dataframe for different factors. 我试图在不同因素的数据框中得到一些变量的均值。 Say I have:
说我有:
time geo var1 var2 var3 var4
1 1990 AT 1 7 13 19
2 1991 AT 2 8 14 20
3 1992 AT 3 9 15 21
4 1990 DE 4 10 16 22
5 1991 DE 5 11 17 23
6 1992 DE 6 12 18 24
And I want: 而且我要:
time geo var1 var2 var3 var4 m_var2 m_var3
1 1990 AT 1 7 13 19 8 14
2 1991 AT 2 8 14 20 8 14
3 1992 AT 3 9 15 21 8 14
4 1990 DE 4 10 16 22 11 17
5 1991 DE 5 11 17 23 11 17
6 1992 DE 6 12 18 24 11 17
I've tried a few things with by() and lapply() but I think this goes into the direction of ddply 我用by()和lapply()尝试过一些东西,但我认为这是ddply的方向
require(plyr)
Dataset <- data.frame(time=rep(c(1990:1992),2),geo=c(rep("AT",3),rep("DE",3))
,var1=as.numeric(c(1:6)),var2=as.numeric(c(7:12)),var3=as.numeric(c(13:18)),
var4=as.numeric(c(19:24)))
newvars <- c("var2","var3")
newData <- Dataset[,c("geo",newvars)]
Currently, I can choose between two errors: 目前,我可以选择两个错误:
ddply(newData,newData[,"geo"],colMeans)
#where R apparently thinks AT is the variable?
ddply(newData,"geo",colMeans)
#where R worries about the factor variable not being numeric?
My lapply attempts got me quite far but then left me with a list I could not get back into the dataframe: 我的lapply尝试让我走得很远,但后来给我留下了一个我无法回到数据框的列表:
lapply(newvars,function(x){
by(Dataset[x],Dataset[,"geo"],function(x)
rep(colMeans(x,na.rm=T),length(unique(Dataset[,"time"]))))
})
I think this must even be able with merge and filters as here: Lapply in a dataframe over different variables using filters , but I can't get it together. 我认为这里必须能够使用合并和过滤器: 使用过滤器在不同变量的数据框中提供 ,但我无法将它们结合在一起。 Any help would be appreciated!
任何帮助,将不胜感激!
Other method with dplyr
使用
dplyr
其他方法
library(dplyr)
df1 %>% group_by(geo) %>% mutate(m_var2=mean(var2), m_var3=mean(var3))
Another simple base R solution is just 另一个简单的基础R解决方案就是
transform(df, m_var2 = ave(var2, geo), m_var3 = ave(var3, geo))
# time geo var1 var2 var3 var4 m_var2 m_var3
# 1 1990 AT 1 7 13 19 8 14
# 2 1991 AT 2 8 14 20 8 14
# 3 1992 AT 3 9 15 21 8 14
# 4 1990 DE 4 10 16 22 11 17
# 5 1991 DE 5 11 17 23 11 17
# 6 1992 DE 6 12 18 24 11 17
Couple years later, I think a more concise approach would be to both update the actual data set (instead of creating a new one) and operate on a vector of columns (instead of manually writing them) 几年后,我认为更简洁的方法是更新实际数据集(而不是创建新数据集)并对列向量进行操作(而不是手动编写它们)
vars <- paste0("var", 2:3) # Select desired cols
df[paste0("m_", vars)] <- lapply(df[vars], ave, df[["geo"]]) # Loop and update
One option would be to use data.table
. 一种选择是使用
data.table
。 We can convert the data.frame
to data.table
( setDT(df1)
), get the mean
( lapply(.SD, mean)
) for the selected columns ('var2' and 'var3') by specifying the column index in .SDcols
, grouped by 'geo'. 我们可以将
data.frame
转换为data.table
( setDT(df1)
),通过指定列索引来获取所选列('var2'和'var3')的mean
( lapply(.SD, mean)
) .SDcols
,按'geo'分组。 Create new columns by assigning the output ( :=
) to the new column names ( paste('m', names(df1)[4:5])
) 通过将输出(
:=
)分配给新列名称来创建新列( paste('m', names(df1)[4:5])
)
library(data.table)
setDT(df1)[, paste('m', names(df1)[4:5], sep="_") :=lapply(.SD, mean)
,by = geo, .SDcols=4:5]
# time geo var1 var2 var3 var4 m_var2 m_var3
#1: 1990 AT 1 7 13 19 8 14
#2: 1991 AT 2 8 14 20 8 14
#3: 1992 AT 3 9 15 21 8 14
#4: 1990 DE 4 10 16 22 11 17
#5: 1991 DE 5 11 17 23 11 17
#6: 1992 DE 6 12 18 24 11 17
NOTE: This method is more general. 注意:此方法更通用。 We can create the
mean
columns even for 100s of variables without any major change in the code. 我们甚至可以为100个变量创建
mean
列,而不会对代码进行任何重大更改。 ie. 即。 if we need to get the
mean
of columns 4:100, change the .SDcols=4:100
and in the paste('m', names(df1)[4:100]
. 如果我们需要得到列4:100的
mean
,则更改.SDcols=4:100
并在paste('m', names(df1)[4:100]
。
df1 <- structure(list(time = c(1990L, 1991L, 1992L, 1990L, 1991L, 1992L
), geo = c("AT", "AT", "AT", "DE", "DE", "DE"), var1 = 1:6, var2 = 7:12,
var3 = 13:18, var4 = 19:24), .Names = c("time", "geo", "var1",
"var2", "var3", "var4"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
在base
R:
cbind(df1,m_var2=ave(df1$var2,df1$geo),m_var3=ave(df1$var3,df1$geo))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.