简体   繁体   中英

How to do rowSums over many columns in ``dplyr`` or ``tidyr``?

For example, is it possible to do this in dplyr:

new_name <- "Sepal.Sum"
col_grep <- "Sepal"

iris <- cbind(iris, tmp_name = rowSums(iris[,grep(col_grep, names(iris))]))
names(iris)[names(iris) == "tmp_name"] <- new_name

This adds up all the columns that contain "Sepal" in the name and creates a new variable named "Sepal.Sum".

Importantly, the solution needs to rely on a grep (or dplyr:::matches , dplyr:::one_of , etc.) when selecting the columns for the rowSums function, and have the name of the new column be dynamic.

My application has many new columns being created in a loop, so an even better solution would use mutate_each_ to generate many of these new columns.

Here a dplyr solution that uses the contains special functions to be used inside select .

 iris %>% mutate(Sepal.Sum = iris %>% rowwise() %>% select(contains("Sepal")) %>% rowSums()) -> iris2
 head(iris2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Sum
1          5.1         3.5          1.4         0.2  setosa       8.6
2          4.9         3.0          1.4         0.2  setosa       7.9
3          4.7         3.2          1.3         0.2  setosa       7.9
4          4.6         3.1          1.5         0.2  setosa       7.7
5          5.0         3.6          1.4         0.2  setosa       8.6
6          5.4         3.9          1.7         0.4  setosa       9.3

and here the benchmarks:

Unit: milliseconds
                                                                                                      expr
 iris2 <- iris %>% mutate(Sepal.Sum = iris %>% rowwise() %>% select(contains("Sepal")) %>%      rowSums())
      min      lq     mean   median       uq      max neval
 1.816496 1.86304 2.132217 1.928748 2.509996 5.252626   100

Didn't want to comment this as it's too long.

Not much in it in terms of timing for the solutions (expect the data.table solution which appearsslower) that have been proposed and none stand out as clearly more elegant.

library(dplyr)
library(data.table)
new_name <- "Sepal.Sum"
col_grep <- "Sepal"
# Make iris bigger
data(iris)
for(i in 1:18){
  iris <- bind_rows(iris, iris)
}
iris1 <- iris
system.time({ 
  # Base solution
  iris1 <- cbind(iris1, tmp_name = rowSums(iris1[,grep(col_grep, names(iris1))])) 
  names(iris1)[names(iris1) == "tmp_name"] <- new_name 
}) 
# 1.26

system.time({ 
  # less elegant dplyr solution
  iris %>% select(matches(col_grep)) %>% rowSums() %>% 
    data.frame(.) %>% bind_cols(iris, .) %>% setNames(., c(names(iris), new_name)) 
})
# 1.14

system.time({ 
  # bit more elegant dplyr solution
  iris %>% mutate(tmp_name = rowSums(.[] %>% select(matches(col_grep)))) %>% 
    rename_(.dots = setNames("tmp_name", new_name))
})
# 1.12

data(iris)
# Make iris bigger
for(i in 1:18){
  iris <- rbindlist(list(iris, iris))
}
system.time({
  setDT(iris)[, tmp_name := rowSums(.SD[,grep(col_grep, names(iris)), with = FALSE])]
  setnames(iris, "tmp_name", new_name)
})
# 2.39

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM