按因子变量在数据框中进行Colmeans

Question

I'm trying to get the mean of some variables inside a dataframe for different factors. 我试图在不同因素的数据框中得到一些变量的均值。 Say I have: 说我有：

    time    geo var1    var2   var3    var4
1   1990    AT  1       7      13       19
2   1991    AT  2       8      14       20
3   1992    AT  3       9      15       21
4   1990    DE  4       10     16       22
5   1991    DE  5       11     17       23
6   1992    DE  6       12     18       24

And I want: 而且我要：

    time    geo var1    var2   var3    var4   m_var2   m_var3
1   1990    AT  1       7      13       19    8        14
2   1991    AT  2       8      14       20    8        14
3   1992    AT  3       9      15       21    8        14
4   1990    DE  4       10     16       22    11       17
5   1991    DE  5       11     17       23    11       17
6   1992    DE  6       12     18       24    11       17

I've tried a few things with by() and lapply() but I think this goes into the direction of ddply 我用by（）和lapply（）尝试过一些东西，但我认为这是ddply的方向

require(plyr)
Dataset <- data.frame(time=rep(c(1990:1992),2),geo=c(rep("AT",3),rep("DE",3))
      ,var1=as.numeric(c(1:6)),var2=as.numeric(c(7:12)),var3=as.numeric(c(13:18)),
      var4=as.numeric(c(19:24)))

newvars <- c("var2","var3")
newData <- Dataset[,c("geo",newvars)]

Currently, I can choose between two errors: 目前，我可以选择两个错误：

ddply(newData,newData[,"geo"],colMeans) 
#where R apparently thinks AT is the variable?

ddply(newData,"geo",colMeans)
#where R worries about the factor variable not being numeric?

My lapply attempts got me quite far but then left me with a list I could not get back into the dataframe: 我的lapply尝试让我走得很远，但后来给我留下了一个我无法回到数据框的列表：

lapply(newvars,function(x){
       by(Dataset[x],Dataset[,"geo"],function(x) 
       rep(colMeans(x,na.rm=T),length(unique(Dataset[,"time"]))))
       })

I think this must even be able with merge and filters as here: Lapply in a dataframe over different variables using filters , but I can't get it together. 我认为这里必须能够使用合并和过滤器：使用过滤器在不同变量的数据框中提供，但我无法将它们结合在一起。 Any help would be appreciated! 任何帮助，将不胜感激！

Answer 1

Other method with dplyr 使用dplyr其他方法

library(dplyr)
df1 %>% group_by(geo) %>% mutate(m_var2=mean(var2), m_var3=mean(var3))

Answer 2

Another simple base R solution is just 另一个简单的基础R解决方案就是

transform(df, m_var2 = ave(var2, geo), m_var3 = ave(var3, geo))
#   time geo var1 var2 var3 var4 m_var2 m_var3
# 1 1990  AT    1    7   13   19      8     14
# 2 1991  AT    2    8   14   20      8     14
# 3 1992  AT    3    9   15   21      8     14
# 4 1990  DE    4   10   16   22     11     17
# 5 1991  DE    5   11   17   23     11     17
# 6 1992  DE    6   12   18   24     11     17

Couple years later, I think a more concise approach would be to both update the actual data set (instead of creating a new one) and operate on a vector of columns (instead of manually writing them) 几年后，我认为更简洁的方法是更新实际数据集（而不是创建新数据集）并对列向量进行操作（而不是手动编写它们）

vars <- paste0("var", 2:3) # Select desired cols
df[paste0("m_", vars)] <- lapply(df[vars], ave, df[["geo"]]) # Loop and update

Answer 3

One option would be to use data.table . 一种选择是使用data.table 。 We can convert the data.frame to data.table ( setDT(df1) ), get the mean ( lapply(.SD, mean) ) for the selected columns ('var2' and 'var3') by specifying the column index in .SDcols , grouped by 'geo'. 我们可以将data.frame转换为data.table （ setDT(df1) ），通过指定列索引来获取所选列（'var2'和'var3'）的mean （ lapply(.SD, mean) ） .SDcols ，按'geo'分组。 Create new columns by assigning the output ( := ) to the new column names ( paste('m', names(df1)[4:5]) ) 通过将输出（ := ）分配给新列名称来创建新列（ paste('m', names(df1)[4:5]) ）

library(data.table)
setDT(df1)[, paste('m', names(df1)[4:5], sep="_") :=lapply(.SD, mean)
            ,by = geo, .SDcols=4:5]
#     time geo var1 var2 var3 var4 m_var2 m_var3
#1: 1990  AT    1    7   13   19      8     14
#2: 1991  AT    2    8   14   20      8     14
#3: 1992  AT    3    9   15   21      8     14
#4: 1990  DE    4   10   16   22     11     17
#5: 1991  DE    5   11   17   23     11     17
#6: 1992  DE    6   12   18   24     11     17

NOTE: This method is more general. 注意：此方法更通用。 We can create the mean columns even for 100s of variables without any major change in the code. 我们甚至可以为100个变量创建mean列，而不会对代码进行任何重大更改。 ie. 即。 if we need to get the mean of columns 4:100, change the .SDcols=4:100 and in the paste('m', names(df1)[4:100] . 如果我们需要得到列4：100的mean ，则更改.SDcols=4:100并在paste('m', names(df1)[4:100] 。

data 数据

df1 <- structure(list(time = c(1990L, 1991L, 1992L, 1990L, 1991L, 1992L
), geo = c("AT", "AT", "AT", "DE", "DE", "DE"), var1 = 1:6, var2 = 7:12, 
var3 = 13:18, var4 = 19:24), .Names = c("time", "geo", "var1", 
"var2", "var3", "var4"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

Answer 4

在base R：

 cbind(df1,m_var2=ave(df1$var2,df1$geo),m_var3=ave(df1$var3,df1$geo))

按因子变量在数据框中进行Colmeans

问题描述

4 个解决方案

解决方案1
7 2015-05-11 13:51:57

解决方案2
7 2015-05-11 13:54:53

解决方案3
5 已采纳 2015-05-11 13:41:30

data 数据

解决方案4
4 2015-05-11 13:53:30

按因子变量在数据框中进行Colmeans

问题描述

4 个解决方案

解决方案1 7 2015-05-11 13:51:57

解决方案2 7 2015-05-11 13:54:53

解决方案3 5 已采纳 2015-05-11 13:41:30

data 数据

解决方案4 4 2015-05-11 13:53:30

解决方案1
7 2015-05-11 13:51:57

解决方案2
7 2015-05-11 13:54:53

解决方案3
5 已采纳 2015-05-11 13:41:30

解决方案4
4 2015-05-11 13:53:30