简体   繁体   English

R:data.frame的聚合列

[英]R: aggregate columns of a data.frame

I have a data.frame that looks like this 我有一个看起来像这样的data.frame

> head(df)
            Memory    Memory    Memory    Memory    Memory     Naive     Naive
10472501  6.075714  5.898929  6.644946  6.023901  6.332126  8.087944  7.520194
10509163  6.168941  6.495393  5.951124  6.052527  6.404401  7.152890  8.335509
10496091 10.125575  9.966211 10.075613 10.310952 10.090649 11.803949 11.274480
10427035  6.644921  6.658567  6.569745  6.499243  6.990852  8.010784  7.798154
10503695  8.379494  8.153917  8.246484  8.390747  8.346748  9.540236  9.091740
10451763 10.986717 11.233819 10.643245 10.230697 10.541396 12.248487 11.823138  

and I'd like to find the mean of the Memory columns and the mean of the Naive columns. 我想找到Memory列的平均值和Naive列的平均值。 The aggregate function aggregates rows. aggregate函数聚合行。 This data.frame could potentially have a large number of rows, and hence transposing then applying aggregate by the colnames of the original data.frame strikes me as bad, and is generally annoying: 这个data.frame可能有大量的行,因此转置然后通过原始data.framecolnames应用aggregate使我感觉很糟糕,并且通常很烦人:

> head(t(aggregate(t(df),list(colnames(df)), mean)))
         [,1]       [,2]      
Group.1  "Memory"   "Naive"   
10472501 "6.195123" "8.125439"
10509163 "6.214477" "7.733625"
10496091 "10.11380" "11.55348"
10427035 "6.672665" "8.266854"
10503695 "8.303478" "9.340436"

What's the blindingly obvious thing I'm missing? 我错过了一件令人眼花缭乱的明显事情?

I am a big advocate of reformatting data so that it's in a "long" format. 我是重新格式化数据的主要倡导者,因此它采用“长”格式。 The utility of the long format is especially evident when it comes to problems like this one. 当涉及到像这样的问题时,长格式的效用尤其明显。 Fortunately, it's easy enough to reshape data like this into almost any format with the reshape package. 幸运的是,使用reshape包将这样的数据重新塑造成几乎任何格式都很容易。

If I understood your question right, you want the mean of Memory and Naive for every row. 如果我理解你的问题,那么你想要每行的MemoryNaive的意思。 For whatever reason, we need to make column names unique for reshape::melt() . 无论出于何种原因,我们需要为reshape::melt()唯一的列名。

colnames(df) <- paste(colnames(df), 1:ncol(df), sep = "_")

Then, you'll have to create an ID column. 然后,您将必须创建一个ID列。 You could either do 你可以做到

df$ID <- 1:nrow(df)

or, if those rownames are meaningful 或者,如果这些rownames是有意义的

df$ID <- rownames(df)

Now, with the reshape package 现在,使用reshape

library(reshape)
df.m <- melt(df, id = "ID")
df.m <- cbind(df.m, colsplit(df.m$variable, split = "_", names = c("Measure", "N")))
df.agg <- cast(df.m, ID ~ Measure, fun = mean)

df.agg should now look like your desired output snippit. df.agg现在应该看起来像你想要的输出snippit。

Or, if you want just the overall means across all the rows, Zack's suggestion will work. 或者,如果你只想要所有行的整体意义,Zack的建议将起作用。 Something like 就像是

m <- colMeans(df)
tapply(m, colnames(df), mean)

You could get the same result, but formatted as a dataframe with 您可以获得相同的结果,但格式化为数据框

cast(df.m, .~variable, fun = mean)

What about something like 怎么样的

l <-lapply(unique(colnames(df)), function(x) rowMeans(df[,colnames(df) == x]))



df <- do.call(cbind.data.frame, l)

To clarify Jonathan Chang's answer... the blindly obvious thing you're missing is that you can just select the columns and issue the rowMeans command. 澄清Jonathan Chang的答案......你忽略的一个明显的事情就是你可以选择列并发出rowMeans命令。 That'll give vector of the means for each row. 那将为每一行提供均值的向量。 His command gets the row means for each group of unique column names and was exactly what I was going to write. 他的命令获取每组唯一列名的行方式,这正是我要编写的内容。 With your sample data the result of his command is two lists. 使用您的示例数据,他的命令结果是两个列表。

rowMeans is also very fast. rowMeans也非常快。

To break it down, to get the means of all of your memory columns only is just 要打破它,只获取所有内存列的方法

rowMeans(df[,colnames(df) == 'Memory']) #or from you example, rowMeans(df[,1:5])

It's the simplest complete correct answer, vote him up and mark him correct if you like it. 这是最简单的完整正确答案,如果你愿意的话,将他投票并标记为正确答案。

(BTW, I also liked Jo's recommendation to keep generally things as long data.) (顺便说一句,我也很喜欢Jo的建议,即保留一些长期数据。)

m = matrix(1:12,3)
colnames(m) = c(1,1,2,2)

m

     1 1 2  2
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

mt = t(m)
sapply(by(mt,rownames(mt),colMeans),identity)

     1    2
V1 2.5  8.5
V2 3.5  9.5
V3 4.5 10.5

我认为你已经加载了没有header=TRUE数据,你拥有的是一个因子矩阵,所以你的一般好主意失败了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM