简体   繁体   中英

R aggregate by large number of columns

I have a data frame (df) that has about 40 columns, and I want to aggregate using a sum on 4 of the columns. Outside of the 4 I want to sum, each unique value in column 1 corresponds to identical values across the rest of the columns, and I want to keep all the columns in the aggregated data frame. Is there any way I can specify the columns in the by = list() portion without having to type them all explicitly? For example, if I knew I wanted to sum column "field" by columns 1-36. I've tried

aggregate(df$field, by = list(df[,1:36]), FUN = sum)

but it throws an error since that isn't a list of names. I've also tried

aggregate(df$field, by = list(names(df)[1:36]), FUN = sum)

And while this doesn't give an error, it gives me back an aggregation with my df names as the unique observations.

Or am I missing an easy way to say "aggregate these four columns using the rest of the data frame?"

Thanks

Here's an example data frame:

  A B C D Sum
1 A B C D   1
2 A B C D   2
3 A B C D   3
4 E F 1 R   4
5 E F 1 R   5

After I aggregate I want it to look like:

  A B C D Sum
1 A B C D 6
2 E F 1 R 9

I know I can do this if I explicitly state x$A, x$B, x$C, x$D in the "by" portion of the aggregate statement, but in my actual data frame this would require explicitly typing about 40 field names.

You are asking how to aggregate the sum of multiple variables, grouped by the remaining variables. I would do this by combining the multiple variables first and then aggregating using the (in my opinion) more convenient formula interface of the aggregate function. For instance, consider aggregating the sum of Sepal.Length, Sepal.Width, and Petal.Length in the iris dataset based on the remaining variables (Petal.Width and Species):

agg <- iris
cols <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
agg$sum <- rowSums(agg[,cols])
agg <- agg[,!names(agg) %in% cols]
aggregate(sum~., data=agg, FUN=sum)
#    Petal.Width    Species   sum
# 1          0.1     setosa  47.8
# 2          0.2     setosa 284.1
# 3          0.3     setosa  68.1
# 4          0.4     setosa  74.6
# 5          0.5     setosa  10.1
# 6          0.6     setosa  10.1
# 7          1.0 versicolor  79.9
# 8          1.1 versicolor  34.3
# 9          1.2 versicolor  63.8
# 10         1.3 versicolor 166.5
# 11         1.4 versicolor  96.7
# 12         1.5 versicolor 136.5
# 13         1.6 versicolor  42.0
# 14         1.7 versicolor  14.7
# 15         1.8 versicolor  13.9
# 16         1.4  virginica  14.3
# 17         1.5  virginica  27.4
# 18         1.6  virginica  16.0
# 19         1.7  virginica  11.9
# 20         1.8  virginica 162.2
# 21         1.9  virginica  71.7
# 22         2.0  virginica  91.3
# 23         2.1  virginica  94.4
# 24         2.2  virginica  48.3
# 25         2.3  virginica 125.6
# 26         2.4  virginica  44.4
# 27         2.5  virginica  48.2

Using the example data mentioned by @josilber, this would be another option to achieve the desired output using dplyr() which is more efficient for huge datasets

library('dplyr')

out = agg %>% 
regroup(lapply(names(select(agg, -sum)), as.symbol)) %>% 
summarise_each(funs(sum))

Source: local data frame [27 x 3]
Groups: Species

#  Species Petal.Width   sum
#1      setosa         0.1  47.8
#2      setosa         0.2 284.1
#3      setosa         0.3  68.1
#4      setosa         0.4  74.6
#5      setosa         0.5  10.1
#6      setosa         0.6  10.1
#7  versicolor         1.0  79.9
#8  versicolor         1.1  34.3
#9  versicolor         1.2  63.8
#10 versicolor         1.3 166.5
#..        ...         ...   ...

using data.table

library('data.table')

out = setDT(agg)[, list(sum = sum(sum)), by= names(agg[,!"sum", with=FALSE])]

#  Species Petal.Width   sum
#1:     setosa         0.2 284.1
#2:     setosa         0.4  74.6
#3:     setosa         0.3  68.1
#4:     setosa         0.1  47.8
#5:     setosa         0.5  10.1
#6:     setosa         0.6  10.1
#7: versicolor         1.4  96.7
#8: versicolor         1.5 136.5
#9: versicolor         1.3 166.5
#10:versicolor         1.6  42.0
# ...

Use the data.frame method ( aggregate.data.frame ) like this:

aggregate(df["field"], by = df[1:36], FUN = sum)

or use the formula method ( aggregate.formula ) like this:

nms <- c("field", names(df)[1:36])
aggregate(field ~., df, sum)

In terms of the example data at the end of the question:

Lines <- " A B C D Sum
1 A B C D   1
2 A B C D   2
3 A B C D   3
4 E F 1 R   4
5 E F 1 R   5"
df <- read.table(text = Lines, header = TRUE)

# data.frame method
aggregate(df["Sum"], df[1:4], sum)

# data.frame method - alternative
aggregate(df[5], df[-5], sum)

# formula method
aggregate(Sum ~., df, sum)

This would be the current answer with dplyr:

library('dplyr')
mytb<-read.table(text="
A B C D Sum
1 A B C D   1
2 A B C D   2
3 A B C D   3
4 E F 1 R   4
5 E F 1 R   5", header=T, stringsAsFactors=F)

mytb %>% 
  group_by_at(names(select(mytb, -"Sum") ) )  %>% 
  summarise_all(.funs=sum)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM