I have a data frame (df) that has about 40 columns, and I want to aggregate using a sum on 4 of the columns. Outside of the 4 I want to sum, each unique value in column 1 corresponds to identical values across the rest of the columns, and I want to keep all the columns in the aggregated data frame. Is there any way I can specify the columns in the by = list() portion without having to type them all explicitly? For example, if I knew I wanted to sum column "field" by columns 1-36. I've tried
aggregate(df$field, by = list(df[,1:36]), FUN = sum)
but it throws an error since that isn't a list of names. I've also tried
aggregate(df$field, by = list(names(df)[1:36]), FUN = sum)
And while this doesn't give an error, it gives me back an aggregation with my df names as the unique observations.
Or am I missing an easy way to say "aggregate these four columns using the rest of the data frame?"
Thanks
Here's an example data frame:
A B C D Sum
1 A B C D 1
2 A B C D 2
3 A B C D 3
4 E F 1 R 4
5 E F 1 R 5
After I aggregate I want it to look like:
A B C D Sum
1 A B C D 6
2 E F 1 R 9
I know I can do this if I explicitly state x$A, x$B, x$C, x$D in the "by" portion of the aggregate statement, but in my actual data frame this would require explicitly typing about 40 field names.
You are asking how to aggregate the sum of multiple variables, grouped by the remaining variables. I would do this by combining the multiple variables first and then aggregating using the (in my opinion) more convenient formula interface of the aggregate
function. For instance, consider aggregating the sum of Sepal.Length, Sepal.Width, and Petal.Length in the iris dataset based on the remaining variables (Petal.Width and Species):
agg <- iris
cols <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
agg$sum <- rowSums(agg[,cols])
agg <- agg[,!names(agg) %in% cols]
aggregate(sum~., data=agg, FUN=sum)
# Petal.Width Species sum
# 1 0.1 setosa 47.8
# 2 0.2 setosa 284.1
# 3 0.3 setosa 68.1
# 4 0.4 setosa 74.6
# 5 0.5 setosa 10.1
# 6 0.6 setosa 10.1
# 7 1.0 versicolor 79.9
# 8 1.1 versicolor 34.3
# 9 1.2 versicolor 63.8
# 10 1.3 versicolor 166.5
# 11 1.4 versicolor 96.7
# 12 1.5 versicolor 136.5
# 13 1.6 versicolor 42.0
# 14 1.7 versicolor 14.7
# 15 1.8 versicolor 13.9
# 16 1.4 virginica 14.3
# 17 1.5 virginica 27.4
# 18 1.6 virginica 16.0
# 19 1.7 virginica 11.9
# 20 1.8 virginica 162.2
# 21 1.9 virginica 71.7
# 22 2.0 virginica 91.3
# 23 2.1 virginica 94.4
# 24 2.2 virginica 48.3
# 25 2.3 virginica 125.6
# 26 2.4 virginica 44.4
# 27 2.5 virginica 48.2
Using the example data mentioned by @josilber, this would be another option to achieve the desired output using dplyr()
which is more efficient for huge datasets
library('dplyr')
out = agg %>%
regroup(lapply(names(select(agg, -sum)), as.symbol)) %>%
summarise_each(funs(sum))
Source: local data frame [27 x 3]
Groups: Species
# Species Petal.Width sum
#1 setosa 0.1 47.8
#2 setosa 0.2 284.1
#3 setosa 0.3 68.1
#4 setosa 0.4 74.6
#5 setosa 0.5 10.1
#6 setosa 0.6 10.1
#7 versicolor 1.0 79.9
#8 versicolor 1.1 34.3
#9 versicolor 1.2 63.8
#10 versicolor 1.3 166.5
#.. ... ... ...
using data.table
library('data.table')
out = setDT(agg)[, list(sum = sum(sum)), by= names(agg[,!"sum", with=FALSE])]
# Species Petal.Width sum
#1: setosa 0.2 284.1
#2: setosa 0.4 74.6
#3: setosa 0.3 68.1
#4: setosa 0.1 47.8
#5: setosa 0.5 10.1
#6: setosa 0.6 10.1
#7: versicolor 1.4 96.7
#8: versicolor 1.5 136.5
#9: versicolor 1.3 166.5
#10:versicolor 1.6 42.0
# ...
Use the data.frame method ( aggregate.data.frame
) like this:
aggregate(df["field"], by = df[1:36], FUN = sum)
or use the formula method ( aggregate.formula
) like this:
nms <- c("field", names(df)[1:36])
aggregate(field ~., df, sum)
In terms of the example data at the end of the question:
Lines <- " A B C D Sum
1 A B C D 1
2 A B C D 2
3 A B C D 3
4 E F 1 R 4
5 E F 1 R 5"
df <- read.table(text = Lines, header = TRUE)
# data.frame method
aggregate(df["Sum"], df[1:4], sum)
# data.frame method - alternative
aggregate(df[5], df[-5], sum)
# formula method
aggregate(Sum ~., df, sum)
This would be the current answer with dplyr:
library('dplyr')
mytb<-read.table(text="
A B C D Sum
1 A B C D 1
2 A B C D 2
3 A B C D 3
4 E F 1 R 4
5 E F 1 R 5", header=T, stringsAsFactors=F)
mytb %>%
group_by_at(names(select(mytb, -"Sum") ) ) %>%
summarise_all(.funs=sum)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.