简体   繁体   English

在R中按字符串列名聚合

[英]Aggregate by string column name in R

I would like to group data in a data.frame by two columns and then sum a specific third column. 我想将data.frame中的数据分组为两列,然后对特定的第三列求和。 For example: 例如:

> aggregate(mpg~gear+cyl, data=mtcars, FUN=sum)
  gear cyl   mpg
1    3   4  21.5
2    4   4 215.4
3    5   4  56.4
4    3   6  39.5
5    4   6  79.0
6    5   6  19.7
7    3   8 180.6
8    5   8  30.8

Now, I need to do this several times for different columns. 现在,我需要为不同的列多次执行此操作。 So I would like to write a function which generalizes this. 所以我想编写一个概括它的函数。 It take the data.frame and one of the columns (to keep things simple) and does the same thing. 它采用data.frame和其中一个列(为了简单起见)并做同样的事情。

agg.data <- function(df, colname) {
  aggregate(mpg~gear+colname, data=df, FUN=sum) 
}

Running this will produce: 运行这将产生:

Error in eval(expr, envir, enclos) : object 'colname' not found

How can I pass in the value of colname to aggregate? 如何将colname的值传递给聚合?

Paste together a string representation of your formula, and give that string as an argument to formula()... 将公式的字符串表示粘贴在一起,并将该字符串作为参数传递给formula()...

agg.data <- function(df, colname) {
  aggregate(formula(paste0("mpg~gear+", colname)), data=df, FUN=sum) 
}

> agg.data(mtcars, "cyl")
  gear cyl   mpg
1    3   4  21.5
2    4   4 215.4
3    5   4  56.4
4    3   6  39.5
5    4   6  79.0
6    5   6  19.7
7    3   8 180.6
8    5   8  30.8

Using data.table : 使用data.table

fun.dt <- function(dt, col) {
    dt[, .(mpg=sum(mpg)), by=c("gear", col)]
}

require(data.table)
dt = as.data.table(mtcars)
fun.dt(dt, "cyl")
#    gear cyl   mpg
# 1:    4   6  79.0
# 2:    4   4 215.4
# 3:    3   6  39.5
# 4:    3   8 180.6
# 5:    3   4  21.5
# 6:    5   4  56.4
# 7:    5   8  30.8
# 8:    5   6  19.7

The by expression in data.tables can also take a character vector of column names in addition to lists of columns/expressions. 除了列/表达式列表之外, data.tables中by表达式还可以采用列名的字符向量。 We can simply provide a character vector to the by argument. 我们可以简单地为by参数提供一个字符向量。

You can easily use the "normal" aggregate interface (ie not the formula interface) to supply column names in variables. 您可以轻松使用“常规” aggregate接口(即不是公式接口)来提供变量中的列名称。 The syntax is slightly different but still easy enough and doesn't require pasting: 语法稍有不同,但仍然很容易,不需要粘贴:

agg.data2 <- function(df, colname) {
  aggregate(df[["mpg"]], list(df[["gear"]], df[[colname]]), FUN=sum) 
}
agg.data2(mtcars, "cyl")
#  Group.1 Group.2     x
#1       3       4  21.5
#2       4       4 215.4
#3       5       4  56.4
#4       3       6  39.5
#5       4       6  79.0
#6       5       6  19.7
#7       3       8 180.6
#8       5       8  30.8

Here's the dplyr equivalent: 这是dplyr的等价物:

library(dplyr)
agg.data.dplyr <- function(df, colname) {
  df %>%
    group_by_(.dots = c("gear", colname)) %>%
    summarise(sum = sum(mpg)) %>%
    ungroup()
}
agg.data.dplyr(mtcars, "cyl")

You can also pass an unquoted column name using deparse and substitute 您还可以使用deparsesubstitute传递未加引号的列名

agg.data <- function(df, colname) {
  aggregate(df$mpg, list(df$gear, df[, deparse(substitute(colname))]), FUN=sum) 
}

agg.data(mtcars, cyl)
#   Group.1 Group.2     x
# 1       3       4  21.5
# 2       4       4 215.4
# 3       5       4  56.4
# 4       3       6  39.5
# 5       4       6  79.0
# 6       5       6  19.7
# 7       3       8 180.6
# 8       5       8  30.8

You can also do this in the style of ggplot or with that allows you to just write the colnames as they are without passing a string by using substitute . 您也可以使用ggplot的样式或者with它来允许您只是按原样编写类名,而不使用substitute传递字符串。

agg.data3 = function (df, colname){
    colname = substitute(colname)
    colname = as.character(colname)
    aggregate(formula(paste0("mpg~gear+", colname)), data=mtcars, FUN=sum)
}

usage 用法

agg.data3(cars, cyl)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM