简体   繁体   English

在列组上应用函数

[英]apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame? 如何使用apply或相关函数创建一个新数据框,其中包含非常大的数据框中每对列的行平均值的结果?

I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). 我有一台仪器在大量样品上输出n重复测量,其中每次测量都是一个矢量(所有测量都是相同的长度矢量)。 I'd like to calculate the average (and other stats) on all replicate measurements of each sample. 我想计算每个样本的所有重复测量的平均值(和其他统计数据)。 This means I need to group n consecutive columns together and do row-wise calculations. 这意味着我需要将n个连续列组合在一起并进行逐行计算。

For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a , dat$b and dat$c and one that is the average of each row for dat$d , dat$e and dat$f . 举一个简单的例子,对两个样本进行三次重复测量,我怎么能得到一个有两列(每个样本一个)的数据帧,一个是dat$a每个重复行的平均值, dat$bdat$c和一个是dat$ddat$edat$f的每一行的平均值。

Here's some example data 这是一些示例数据

dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))

            a          b            c          d           e          f
1  -0.9089594 -0.8144765  0.872691548  0.4051094 -0.09705234 -1.5100709
2   0.7993102  0.3243804  0.394560355  0.6646588  0.91033497  2.2504104
3   0.2963102 -0.2911078 -0.243723116  1.0661698 -0.89747522 -0.8455833
4  -0.4311512 -0.5997466 -0.545381175  0.3495578  0.38359390  0.4999425
5  -0.4955802  1.8949285 -0.266580411  1.2773987 -0.79373386 -1.8664651
6   1.0957793 -0.3326867 -1.116623982 -0.8584253  0.83704172  1.8368212
7  -0.2529444  0.5792413 -0.001950741  0.2661068  1.17515099  0.4875377
8   1.2560402  0.1354533  1.440160168 -2.1295397  2.05025701  1.0377283
9   0.8123061  0.4453768  1.598246016  0.7146553 -1.09476532  0.0600665
10  0.1084029 -0.4934862 -0.584671816 -0.8096653  1.54466019 -1.8117459
11 -0.8152812  0.9494620  0.100909570  1.5944528  1.56724269  0.6839954
12  0.3130357  2.6245864  1.750448404 -0.7494403  1.06055267  1.0358267
13  1.1976817 -1.2110708  0.719397607 -0.2690107  0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443  0.34302096 -0.8024803
15  0.2361756  0.6773727  1.279737692  0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335  0.753090683  2.0394865  0.79006103  0.5704210

I'm after something like this 我是在经历这样的事情

            X1          X2
1  -0.28358147 -0.40067128
2   0.50608365  1.27513471
3  -0.07950691 -0.22562957
4  -0.52542633  0.41103139
5   0.37758930 -0.46093340
6  -0.11784382  0.60514586
7   0.10811540  0.64293184
8   0.94388455  0.31948189
9   0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11  0.07836345  1.28189698
12  1.56269017  0.44897971
13  0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15  0.73109533  0.11872758
16 -0.54599850  1.13332286

which I did with this, but is obviously no good for my much larger data frame... 我对此做了什么,但显然对我更大的数据框架没有好处......

data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))

I've tried apply and loops and can't quite get it together. 我已经尝试过apply和循环,并不能完全融合在一起。 My actual data has some hundreds of columns. 我的实际数据有几百列。

This may be more generalizable to your situation in that you pass a list of indices. 通过索引列表,这可能更适合您的情况。 If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply : 如果速度是一个问题(大数据帧)我会选择lapplydo.call而非sapply

x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

Works if you just have col names too: 如果您只有col名称也可以工作:

x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

EDIT 编辑

Just happened to think maybe you want to automate this to do every three columns. 碰巧想想也许你想要自动执行每三列。 I know there's a better way but here it is on a 100 column data set: 我知道有一种更好的方法,但这里是100列数据集:

dat <- data.frame(matrix(rnorm(16*100), ncol=100))

n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))

EDIT 2 Still not happy with the indexing. 编辑2仍然不满意索引。 I think there's a better/faster way to pass the indexes. 我认为有更好/更快的方式来传递索引。 here's a second though not satisfying method: 这是第二种虽然不令人满意的方法:

n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]

do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))

mean for rows from vectors a,b,c 来自向量a,b,c的行的平均值

 rowMeans(dat[1:3])

means for rows from vectors d,e,f 表示来自向量d,e,f的行

 rowMeans(dat[4:6])

all in one call you get 你得到一个电话

results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))

if you only know the names of the columns and not the order then you can use: 如果您只知道列的名称而不知道订单,那么您可以使用:

rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))

#I dont know how much damage this does to speed but should still be quick

A similar question was asked here by @david: averaging every 16 columns in r (now closed), which I answered by adapting @TylerRinker's answer above, following a suggestion by @joran and @Ben. @david在这里提出了一个类似的问题: 在r (现已关闭)中平均每16列 ,我根据@joran和@Ben的建议,通过调整@ TylerRinker上面的答案来回答。 Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data. 因为生成的函数可能对OP或未来的读者有所帮助,我在这里复制该函数,以及OP数据的示例。

# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
    # Create index list
    if (length(by) == 1)
    {
        nc <- ncol(x)
        split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
    } else # 'by' is a vector of groups
    {
        nc <- length(by)
        split.index <- by
    }
    index.list <- split(seq(from = 1, to = nc), split.index)

    # Pass index list to fun using sapply() and return object
    sapply(index.list, function(i)
            {
                do.call(fun, list(x[, i], ...))
            })
}

Then, to find the mean of the replicates: 然后,找到重复的平均值:

byapply(dat, 3, rowMeans)

Or, perhaps the standard deviation of the replicates: 或者,也许是重复的标准偏差:

byapply(dat, 3, apply, 1, sd)

Update 更新

by can also be specified as a vector of groups: by也可以指定为组的向量:

byapply(dat, c(1,1,1,2,2,2), rowMeans)

rowMeans解决方案会更快,但为了完整性这里是你如何可能做到这一点与apply

t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))

Inspired by @joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful): 受到@joran的建议的启发,我想出了这个(实际上与他建议的有点不同,尽管转置建议特别有用):

Make a data frame of example data with p cols to simulate a realistic data set (following @TylerRinker's answer above and unlike my poor example in the question) 使用p cols创建示例数据的数据框以模拟真实的数据集(遵循@ TylerRinker上面的答案,而不像我在问题中的不良示例)

p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))

Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing) 重命名此数据框中的列以创建n个连续列的组,这样如果我对三列组感兴趣,我会得到列名,如1,1,1,2,2,2,3,3,3等等或者如果我想要四列的组,它将是1,1,1,1,2,2,2,3,3,3,3等我现在要用三个(我猜)对于像我这样对索引知之甚少的人来说,这是一种索引)

n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))

Now use apply and tapply to get row means for each of the groups 现在使用apply和tapply为每个组获取行方式

dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))

The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order. 主要的缺点是原始数据中的列名被替换(虽然这可以通过将分组编号放在一个新行而不是列中来克服),并且apply-tapply函数返回的列名称无用订购。

Further to @joran's suggestion, here's a data.table solution: 继@joran的建议,这是一个data.table解决方案:

p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <-  data.frame(t(dat))

n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))

library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]

Thanks everyone for your quick and patient efforts! 感谢大家的快速耐心努力!

There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics. 如果您有兴趣将函数应用于每个独特的列组合(称为组合学),那么有一个非常简单的解决方案。

combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))

To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. 要计算三列等每个唯一组合的统计数据,只需将2更改为3.操作是矢量化的,因此比循环更快,例如上面使用的apply族函数。 If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn 如果列的顺序很重要,那么你需要一个设计来重现有序集的置换算法: combinat::permn

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM