简体   繁体   English

按 R 中的列名组聚合矩阵(或 data.frame)

[英]aggregate a matrix (or data.frame) by column name groups in R

I have a large matrix with about 3000 columns x 3000 rows.我有一个大约 3000 列 x 3000 行的大矩阵。 I'd like to aggregate (calculate the mean) grouped by column names for every row.我想聚合(计算平均值)按每一行的列名分组。 Each column is named similar to this method...(and in random order)每列的名称类似于此方法...(并以随机顺序)

 Tree Tree House House Tree Car Car House

I would need the data result (aggregation of mean of every row) to have the following columns:我需要数据结果(每行平均值的聚合)具有以下列:

  Tree House Car
  • the tricky part (at least for me) is that I do not know all the column names and they are all in random order!棘手的部分(至少对我而言)是我不知道所有列名,而且它们的顺序都是随机的!

You could try你可以试试

res1 <- vapply(unique(colnames(m1)), function(x) 
      rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
                             numeric(nrow(m1)) )

Or或者

res2 <-  sapply(unique(colnames(m1)), function(x) 
       rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )

identical(res1,res2)
#[1] TRUE

Another option might be to reshape into long form and then do the aggregation另一种选择可能是重塑为长形式,然后进行聚合

 library(data.table)
 res3 <-dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,Var1:= NULL]
 identical(res1, as.matrix(res3))
 [1] TRUE

Benchmarks基准

It seems like the first two methods are slightly faster for a 3000*3000 matrix对于 3000*3000 矩阵,前两种方法似乎稍快

set.seed(24)
m1 <- matrix(sample(0:40, 3000*3000, replace=TRUE), 
   ncol=3000, dimnames=list(NULL, sample(c('Tree', 'House', 'Car'),
    3000,replace=TRUE)))

library(microbenchmark)

f1 <-function() {vapply(unique(colnames(m1)), function(x) 
     rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
                           numeric(nrow(m1)) )}
f2 <- function() {sapply(unique(colnames(m1)), function(x) 
       rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )}

f3 <- function() {dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,
            Var1:= NULL]}

microbenchmark(f1(), f2(), f3(), unit="relative", times=10L)
#   Unit: relative
# expr      min       lq     mean   median       uq      max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
# f2() 1.026208 1.027723 1.037593 1.034516 1.028847 1.079004    10
# f3() 4.529037 4.567816 4.834498 4.855776 4.930984 5.529531    10

data数据

 set.seed(24)
 m1 <- matrix(sample(0:40, 10*40, replace=TRUE), ncol=10, 
     dimnames=list(NULL, sample(c("Tree", "House", "Car"), 10, replace=TRUE)))

I came up with my own solution.我想出了我自己的解决方案。 I first just transpose the matrix (called test_mean) so the columns become rows,then:我首先只是转置矩阵(称为 test_mean),使列变为行,然后:

# removing numbers from rownames
rownames(test_mean)<-gsub("[0-9.]","",rownames(test_mean))


#aggregate by rownames
test_mean<-aggregate(test_mean, by=list(rownames(test_mean)), FUN=mean)

matrixStats:rowMeans2 with some coercive help from data.table, for the win! matrixStats:rowMeans2matrixStats:rowMeans2的一些强制帮助,为胜利!

Adding it to benchmarking from @akrun we get:将其添加到@akrun 的基准测试中,我们得到:

f4<- function() {
  ucn<-unique(colnames(m1))
  as.matrix(setnames(setDF(lapply(ucn, function(n) rowMeans2(m1,cols=colnames(m1)==n)))
                    ,ucn))
  }

> all.equal(f4(),f1())
[1] TRUE

> microbenchmark(f1(), f2(), f3(), f4(), unit="relative", times=10L)
Unit: relative
 expr       min        lq      mean    median        uq       max neval cld
 f1()  1.837496  1.841282  1.823375  1.834471  1.818822  1.749826    10  b 
 f2()  1.760133  1.825352  1.817355  1.826257  1.838439  1.793824    10  b 
 f3() 15.451106 15.606912 15.847117 15.586192 16.626629 16.104648    10   c
 f4()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    10 a  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM