简体   繁体   中英

In R, how do I average rows of a data frame over groups without renaming each column?

I have a data frame with 68 columns of variables.

> dim(full_data)
[1] 10299    68

Example:
    F1 F2 M1 M2 M3 ... M66
    1  A  3  5  8      1
    1  B  4  1  2      5
    1  A  9  8  7      7

I need to average all the columns M1 to M66 by grouping on F1 and F2.

Most methods seem to be something like this: ddply(full_data,c("F1","F2"),summarise,MEAN=mean(M1)) --where a new row: MEAN is specified and created. I don't want to do that for 66 columns. I would prefer the column names to just stay the same.

Example Result:
    F1 F2 M1  M2   M3 ... M66
    1  A  6  6.5  7.5      4
    1  B  4    1    2      5

Assuming your data set called df

## install.packages("data.table")
library(data.table)
setDT(df)[, lapply(.SD, mean), by = list(F1, F2)]
##    F1 F2 M1  M2  M3 M66
## 1:  1  A  6 6.5 7.5   4
## 2:  1  B  4 1.0 2.0   5

If you also have some other columns in your data set and you want to include only M1:M66, you could use .SDcols too

setDT(df)[, lapply(.SD, mean), .SDcols = paste0("M", seq_len(66)), by = list(F1, F2)]

Or you could use dplyr too

library(dplyr)
df %>%
  group_by(F1, F2) %>%
  summarise_each(funs(mean))
## Source: local data frame [2 x 6]
## Groups: F1
##     F1 F2 M1  M2  M3 M66
##  1:  1  A  6 6.5 7.5   4
##  2:  1  B  4 1.0 2.0   5

Here's a base R solution which I suspect will be more efficient than aggregate or ddply

t(vapply(split(df[, -c(1:2)], df[, 1:2], drop = TRUE), colMeans, double(4))) # In your case it will be double(66)
##     M1  M2  M3 M66
## 1.A  6 6.5 7.5   4
## 1.B  4 1.0 2.0   5

Or using base R

 aggregate(.~F1+F2, df, mean)
 #   F1 F2 M1  M2  M3 M66
 #1  1  A  6 6.5 7.5   4
 #2  1  B  4 1.0 2.0   5

Using ddply , you can do colwise

 library(plyr)
 ddply(df, .(F1, F2), numcolwise(mean))
 #  F1 F2 M1  M2  M3 M66
 #1  1  A  6 6.5 7.5   4
 #2  1  B  4 1.0 2.0   5

data

df <- structure(list(F1 = c(1L, 1L, 1L), F2 = c("A", "B", "A"), M1 = c(3L, 
4L, 9L), M2 = c(5L, 1L, 8L), M3 = c(8L, 2L, 7L), M66 = c(1L, 
5L, 7L)), .Names = c("F1", "F2", "M1", "M2", "M3", "M66"), class = "data.frame", row.names = c(NA, 
-3L))

Using base functions, you could do

mydf <- data.frame(F1 = sample(c("a", "b", "c"), 100, replace = TRUE), 
                   F2 = sample(c("1", "2"), 100, replace = TRUE),
                   M1 = runif(100),
                   M2 = runif(100),
                   M3 = runif(100))

aggregate(. ~ F1 + F2, FUN = mean, data = mydf)

  F1 F2        M1        M2        M3
1  a  1 0.5787761 0.5044229 0.4641159
2  b  1 0.5427231 0.4923563 0.5289595
3  c  1 0.5145906 0.5709069 0.4812297
4  a  2 0.4161674 0.4815931 0.5127524
5  b  2 0.5018423 0.4337168 0.5563098
6  c  2 0.4326560 0.4749937 0.4575443

This will use all non F1 and F2 columns to average. You could construct a formula to include only specific M* columns or you can do a subset of a data.frame using for example grepl .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM