简体   繁体   English

按因子变量在数据框中进行Colmeans

[英]Colmeans in a dataframe by factor variable

I'm trying to get the mean of some variables inside a dataframe for different factors. 我试图在不同因素的数据框中得到一些变量的均值。 Say I have: 说我有:

    time    geo var1    var2   var3    var4
1   1990    AT  1       7      13       19
2   1991    AT  2       8      14       20
3   1992    AT  3       9      15       21
4   1990    DE  4       10     16       22
5   1991    DE  5       11     17       23
6   1992    DE  6       12     18       24

And I want: 而且我要:

    time    geo var1    var2   var3    var4   m_var2   m_var3
1   1990    AT  1       7      13       19    8        14
2   1991    AT  2       8      14       20    8        14
3   1992    AT  3       9      15       21    8        14
4   1990    DE  4       10     16       22    11       17
5   1991    DE  5       11     17       23    11       17
6   1992    DE  6       12     18       24    11       17

I've tried a few things with by() and lapply() but I think this goes into the direction of ddply 我用by()和lapply()尝试过一些东西,但我认为这是ddply的方向

require(plyr)
Dataset <- data.frame(time=rep(c(1990:1992),2),geo=c(rep("AT",3),rep("DE",3))
      ,var1=as.numeric(c(1:6)),var2=as.numeric(c(7:12)),var3=as.numeric(c(13:18)),
      var4=as.numeric(c(19:24)))

newvars <- c("var2","var3")
newData <- Dataset[,c("geo",newvars)]

Currently, I can choose between two errors: 目前,我可以选择两个错误:

ddply(newData,newData[,"geo"],colMeans) 
#where R apparently thinks AT is the variable?

ddply(newData,"geo",colMeans)
#where R worries about the factor variable not being numeric?

My lapply attempts got me quite far but then left me with a list I could not get back into the dataframe: 我的lapply尝试让我走得很远,但后来给我留下了一个我无法回到数据框的列表:

lapply(newvars,function(x){
       by(Dataset[x],Dataset[,"geo"],function(x) 
       rep(colMeans(x,na.rm=T),length(unique(Dataset[,"time"]))))
       })

I think this must even be able with merge and filters as here: Lapply in a dataframe over different variables using filters , but I can't get it together. 我认为这里必须能够使用合并和过滤器: 使用过滤器在不同变量的数据框中提供 ,但我无法将它们结合在一起。 Any help would be appreciated! 任何帮助,将不胜感激!

Other method with dplyr 使用dplyr其他方法

library(dplyr)
df1 %>% group_by(geo) %>% mutate(m_var2=mean(var2), m_var3=mean(var3))

Another simple base R solution is just 另一个简单的基础R解决方案就是

transform(df, m_var2 = ave(var2, geo), m_var3 = ave(var3, geo))
#   time geo var1 var2 var3 var4 m_var2 m_var3
# 1 1990  AT    1    7   13   19      8     14
# 2 1991  AT    2    8   14   20      8     14
# 3 1992  AT    3    9   15   21      8     14
# 4 1990  DE    4   10   16   22     11     17
# 5 1991  DE    5   11   17   23     11     17
# 6 1992  DE    6   12   18   24     11     17

Couple years later, I think a more concise approach would be to both update the actual data set (instead of creating a new one) and operate on a vector of columns (instead of manually writing them) 几年后,我认为更简洁的方法是更新实际数据集(而不是创建新数据集)并对列向量进行操作(而不是手动编写它们)

vars <- paste0("var", 2:3) # Select desired cols
df[paste0("m_", vars)] <- lapply(df[vars], ave, df[["geo"]]) # Loop and update

One option would be to use data.table . 一种选择是使用data.table We can convert the data.frame to data.table ( setDT(df1) ), get the mean ( lapply(.SD, mean) ) for the selected columns ('var2' and 'var3') by specifying the column index in .SDcols , grouped by 'geo'. 我们可以将data.frame转换为data.tablesetDT(df1) ),通过指定列索引来获取所选列('var2'和'var3')的meanlapply(.SD, mean).SDcols ,按'geo'分组。 Create new columns by assigning the output ( := ) to the new column names ( paste('m', names(df1)[4:5]) ) 通过将输出( := )分配给新列名称来创建新列( paste('m', names(df1)[4:5])

library(data.table)
setDT(df1)[, paste('m', names(df1)[4:5], sep="_") :=lapply(.SD, mean)
            ,by = geo, .SDcols=4:5]
#     time geo var1 var2 var3 var4 m_var2 m_var3
#1: 1990  AT    1    7   13   19      8     14
#2: 1991  AT    2    8   14   20      8     14
#3: 1992  AT    3    9   15   21      8     14
#4: 1990  DE    4   10   16   22     11     17
#5: 1991  DE    5   11   17   23     11     17
#6: 1992  DE    6   12   18   24     11     17

NOTE: This method is more general. 注意:此方法更通用。 We can create the mean columns even for 100s of variables without any major change in the code. 我们甚至可以为100个变量创建mean列,而不会对代码进行任何重大更改。 ie. 即。 if we need to get the mean of columns 4:100, change the .SDcols=4:100 and in the paste('m', names(df1)[4:100] . 如果我们需要得到列4:100的mean ,则更改.SDcols=4:100并在paste('m', names(df1)[4:100]

data 数据

df1 <- structure(list(time = c(1990L, 1991L, 1992L, 1990L, 1991L, 1992L
), geo = c("AT", "AT", "AT", "DE", "DE", "DE"), var1 = 1:6, var2 = 7:12, 
var3 = 13:18, var4 = 19:24), .Names = c("time", "geo", "var1", 
"var2", "var3", "var4"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

base R:

 cbind(df1,m_var2=ave(df1$var2,df1$geo),m_var3=ave(df1$var3,df1$geo))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM