简体   繁体   中英

Aggregate by string column name in R

I would like to group data in a data.frame by two columns and then sum a specific third column. For example:

> aggregate(mpg~gear+cyl, data=mtcars, FUN=sum)
  gear cyl   mpg
1    3   4  21.5
2    4   4 215.4
3    5   4  56.4
4    3   6  39.5
5    4   6  79.0
6    5   6  19.7
7    3   8 180.6
8    5   8  30.8

Now, I need to do this several times for different columns. So I would like to write a function which generalizes this. It take the data.frame and one of the columns (to keep things simple) and does the same thing.

agg.data <- function(df, colname) {
  aggregate(mpg~gear+colname, data=df, FUN=sum) 
}

Running this will produce:

Error in eval(expr, envir, enclos) : object 'colname' not found

How can I pass in the value of colname to aggregate?

Paste together a string representation of your formula, and give that string as an argument to formula()...

agg.data <- function(df, colname) {
  aggregate(formula(paste0("mpg~gear+", colname)), data=df, FUN=sum) 
}

> agg.data(mtcars, "cyl")
  gear cyl   mpg
1    3   4  21.5
2    4   4 215.4
3    5   4  56.4
4    3   6  39.5
5    4   6  79.0
6    5   6  19.7
7    3   8 180.6
8    5   8  30.8

Using data.table :

fun.dt <- function(dt, col) {
    dt[, .(mpg=sum(mpg)), by=c("gear", col)]
}

require(data.table)
dt = as.data.table(mtcars)
fun.dt(dt, "cyl")
#    gear cyl   mpg
# 1:    4   6  79.0
# 2:    4   4 215.4
# 3:    3   6  39.5
# 4:    3   8 180.6
# 5:    3   4  21.5
# 6:    5   4  56.4
# 7:    5   8  30.8
# 8:    5   6  19.7

The by expression in data.tables can also take a character vector of column names in addition to lists of columns/expressions. We can simply provide a character vector to the by argument.

You can easily use the "normal" aggregate interface (ie not the formula interface) to supply column names in variables. The syntax is slightly different but still easy enough and doesn't require pasting:

agg.data2 <- function(df, colname) {
  aggregate(df[["mpg"]], list(df[["gear"]], df[[colname]]), FUN=sum) 
}
agg.data2(mtcars, "cyl")
#  Group.1 Group.2     x
#1       3       4  21.5
#2       4       4 215.4
#3       5       4  56.4
#4       3       6  39.5
#5       4       6  79.0
#6       5       6  19.7
#7       3       8 180.6
#8       5       8  30.8

Here's the dplyr equivalent:

library(dplyr)
agg.data.dplyr <- function(df, colname) {
  df %>%
    group_by_(.dots = c("gear", colname)) %>%
    summarise(sum = sum(mpg)) %>%
    ungroup()
}
agg.data.dplyr(mtcars, "cyl")

You can also pass an unquoted column name using deparse and substitute

agg.data <- function(df, colname) {
  aggregate(df$mpg, list(df$gear, df[, deparse(substitute(colname))]), FUN=sum) 
}

agg.data(mtcars, cyl)
#   Group.1 Group.2     x
# 1       3       4  21.5
# 2       4       4 215.4
# 3       5       4  56.4
# 4       3       6  39.5
# 5       4       6  79.0
# 6       5       6  19.7
# 7       3       8 180.6
# 8       5       8  30.8

You can also do this in the style of ggplot or with that allows you to just write the colnames as they are without passing a string by using substitute .

agg.data3 = function (df, colname){
    colname = substitute(colname)
    colname = as.character(colname)
    aggregate(formula(paste0("mpg~gear+", colname)), data=mtcars, FUN=sum)
}

usage

agg.data3(cars, cyl)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM