简体   繁体   中英

Sum across multiple columns with dplyr

My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr . The data entries in the columns are binary(0,1). I am thinking of a row-wise analog of the summarise_each or mutate_each function of dplyr . Below is a minimal example of the data frame:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))

> df
   x1 x2 x3 x4 x5
1   1  1  0  1  1
2   0  1  1  0  1
3   0 NA  0 NA NA
4  NA  1  1  1  1
5   0  1  1  0  1
6   1  0  0  0  1
7   1 NA NA NA NA
8  NA NA NA  0  1
9   0  0  0  0  0
10  1  1  1  1  1

I could use something like:

df <- df %>% mutate(sumrow= x1 + x2 + x3 + x4 + x5)

but this would involve writing out the names of each of the columns. I have like 50 columns. In addition, the column names change at different iterations of the loop in which I want to implement this operation so I would like to try avoid having to give any column names.

How can I do that most efficiently? Any assistance would be greatly appreciated.

dplyr >= 1.0.0 using across

sum up each row using rowSums ( rowwise works for any aggreation, but is slower)

df %>%
   replace(is.na(.), 0) %>%
   mutate(sum = rowSums(across(where(is.numeric))))

sum down each column

df %>%
   summarise(across(everything(), ~ sum(., is.na(.), 0)))

dplyr < 1.0.0

sum up each row

df %>%
   replace(is.na(.), 0) %>%
   mutate(sum = rowSums(.[1:5]))

sum down each column using superseeded summarise_all :

df %>%
   replace(is.na(.), 0) %>%
   summarise_all(funs(sum))

If you want to sum certain columns only, I'd use something like this:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>% select(x3:x5) %>% rowSums(na.rm=TRUE) -> df$x3x5.total
head(df)

This way you can use dplyr::select 's syntax.

I would use regular expression matching to sum over variables with certain pattern names. For example:

df <- df %>% mutate(sum1 = rowSums(.[grep("x[3-5]", names(.))], na.rm = TRUE),
                    sum_all = rowSums(.[grep("x", names(.))], na.rm = TRUE))

This way you can create more than one variable as a sum of certain group of variables of your data frame.

Using reduce() from purrr is slightly faster than rowSums and definately faster than apply , since you avoid iterating over all the rows and just take advantage of the vectorized operations:

library(purrr)
library(dplyr)
iris %>% mutate(Petal = reduce(select(., starts_with("Petal")), `+`))

See this for timings

dplyr >= 1.0.0

In newer versions of dplyr you can use rowwise() along with c_across to perform row-wise aggregation for functions that do not have specific row-wise variants, but if the row-wise variant exists it should be faster.

Since rowwise() is just a special form of grouping and changes the way verbs work you'll likely want to pipe it to ungroup() after doing your row-wise operation.

To select a range by name :

df %>%
  rowwise() %>% 
  mutate(sumrange = sum(c_across(x1:x5), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()

To select by type :

df %>%
  rowwise() %>% 
  mutate(sumnumeric = sum(c_across(where(is.numeric)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()

To select by column name :

You can use any number of tidy selection helpers like starts_with , ends_with , contains , etc.

df %>%
    rowwise() %>% 
    mutate(sum_startswithx = sum(c_across(starts_with("x")), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()

To select by column index :

df %>% 
  rowwise() %>% 
  mutate(sumindex = sum(c_across(c(1:4, 5)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()

rowise() will work for any summary function . However, in your specific case a row-wise variant exists ( rowSums ) so you can do the following (note the use of across instead), which will be faster:

df %>%
  mutate(sumrow = rowSums(across(x1:x5), na.rm = T))

For more information see the page on rowwise .


Benchmarking

For this example, the the row-wise variant rowSums takes about half as much time:

library(microbenchmark)

microbenchmark(
  df %>%
    dplyr::rowwise() %>% 
    dplyr::mutate(sumrange = sum(dplyr::c_across(x1:x5), na.rm = T)),
  df %>%
    dplyr::mutate(sumrow = rowSums(dplyr::across(x1:x5), na.rm = T)),
  times = 1000L
)

    min    lq     mean  median      uq     max neval cld
 5.5256 6.256 7.024232 6.58885 7.02325 22.1911  1000   b
 2.7011 3.112 3.661106 3.41070 3.71975 32.6282  1000  a 

c_across versus across

In the particular case of the sum function, across and c_across give the same output for much of the code above:

sum_across <- df %>%
    rowwise() %>% 
    mutate(sumrange = sum(across(x1:x5), na.rm = T))

sum_c_across <- df %>%
    rowwise() %>% 
    mutate(sumrange = sum(c_across(x1:x5), na.rm = T)

all.equal(sum_across, sum_c_across)
[1] TRUE

The row-wise output of c_across is a vector (hence the c_ ), while the row-wise output of across is a 1-row tibble object:

df %>% 
  rowwise() %>% 
  mutate(c_across = list(c_across(x1:x5)),
         across = list(across(x1:x5)),
         .keep = "unused") %>% 
  ungroup() 

# A tibble: 10 x 2
   c_across  across          
   <list>    <list>          
 1 <dbl [5]> <tibble [1 x 5]>
 2 <dbl [5]> <tibble [1 x 5]>
 3 <dbl [5]> <tibble [1 x 5]>
 4 <dbl [5]> <tibble [1 x 5]>
 5 <dbl [5]> <tibble [1 x 5]>
 6 <dbl [5]> <tibble [1 x 5]>
 7 <dbl [5]> <tibble [1 x 5]>
 8 <dbl [5]> <tibble [1 x 5]>
 9 <dbl [5]> <tibble [1 x 5]>
10 <dbl [5]> <tibble [1 x 5]>

The function you want to apply will necessitate, which verb you use. As shown above with sum you can use them nearly interchangeably. However, mean and many other common functions expect a (numeric) vector as its first argument:

class(df[1,])
"data.frame"

sum(df[1,]) # works with data.frame
[1] 4

mean(df[1,]) # does not work with data.frame
[1] NA
Warning message:
In mean.default(df[1, ]) : argument is not numeric or logical: returning NA
class(unname(unlist(df[1,])))
"numeric"

sum(unname(unlist(df[1,]))) # works with numeric vector
[1] 4

mean(unname(unlist(df[1,]))) # works with numeric vector
[1] 0.8

Ignoring the row-wise variant that exists for mean ( rowMean ) then in this case c_across should be used:

df %>% 
  rowwise() %>% 
  mutate(avg = mean(c_across(x1:x5), na.rm = T)) %>% 
  ungroup()

# A tibble: 10 x 6
      x1    x2    x3    x4    x5   avg
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1     1     0     1     1   0.8
 2     0     1     1     0     1   0.6
 3     0    NA     0    NA    NA   0  
 4    NA     1     1     1     1   1  
 5     0     1     1     0     1   0.6
 6     1     0     0     0     1   0.4
 7     1    NA    NA    NA    NA   1  
 8    NA    NA    NA     0     1   0.5
 9     0     0     0     0     0   0  
10     1     1     1     1     1   1  

# Does not work
df %>% 
  rowwise() %>% 
  mutate(avg = mean(across(x1:x5), na.rm = T)) %>% 
  ungroup()

rowSums , rowMeans , etc. can take a numeric data frame as the first argument, which is why they work with across .

I encounter this problem often, and the easiest way to do this is to use the apply() function within a mutate command.

library(tidyverse)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))

df %>%
  mutate(sum = select(., x1:x5) %>% apply(1, sum, na.rm=TRUE))

Here you could use whatever you want to select the columns using the standard dplyr tricks (eg starts_with() or contains() ). By doing all the work within a single mutate command, this action can occur anywhere within a dplyr stream of processing steps. Finally, by using the apply() function, you have the flexibility to use whatever summary you need, including your own purpose built summarization function.

Alternatively, if the idea of using a non-tidyverse function is unappealing, then you could gather up the columns, summarize them and finally join the result back to the original data frame.

df <- df %>% mutate( id = 1:n() )   # Need some ID column for this to work

df <- df %>%
  group_by(id) %>%
  gather('Key', 'value', starts_with('x')) %>%
  summarise( Key.Sum = sum(value) ) %>%
  left_join( df, . )

Here I used the starts_with() function to select the columns and calculated the sum and you can do whatever you want with NA values. The downside to this approach is that while it is pretty flexible, it doesn't really fit into a dplyr stream of data cleaning steps.

Benchmarking (almost) all options to sum across columns

As it's difficult to decide among all the interesting answers given by @skd, @LMc, and others, I benchmarked all alternatives which are reasonably long.

The difference to other examples is that I used a larger dataset (10.000 rows) and from a real world dataset (diamonds), so the findings might reflect more the variance of real world data.

The reproducible benchmarking code is:

set.seed(17)
dataset <- diamonds %>% sample_n(1e4)
cols <- c("depth", "table", "x", "y", "z")

sum.explicit <- function() {
  dataset %>%
    mutate(sum.cols = depth + table + x + y + z)
}

sum.rowSums <- function() {
  dataset %>%
    mutate(sum.cols = rowSums(across(cols)))
}

sum.reduce <- function() {
  dataset %>%
    mutate(sum.cols = purrr::reduce(select(., cols), `+`))
}

sum.nest <- function() {
  dataset %>%
  group_by(id = row_number()) %>%
  nest(data = cols) %>%
  mutate(sum.cols = map_dbl(data, sum))
}

# NOTE: across with rowwise doesn't work with all functions!
sum.across <- function() {
  dataset %>%
    rowwise() %>%
    mutate(sum.cols = sum(across(cols)))
}

sum.c_across <- function() {
  dataset %>%
  rowwise() %>%
  mutate(sum.cols = sum(c_across(cols)))
}

sum.apply <- function() {
  dataset %>%
    mutate(sum.cols = select(., cols) %>%
             apply(1, sum, na.rm = TRUE))
}

bench <- microbenchmark::microbenchmark(
  sum.nest(),
  sum.across(),
  sum.c_across(),
  sum.apply(),
  sum.explicit(),
  sum.reduce(),
  sum.rowSums(),
  times = 10
)

bench %>% print(order = 'mean', signif = 3)
Unit: microseconds
           expr     min      lq    mean  median      uq     max neval
 sum.explicit()     796     839    1160     950    1040    3160    10
  sum.rowSums()    1430    1450    1770    1650    1800    2980    10
   sum.reduce()    1650    1700    2090    2000    2140    3300    10
    sum.apply()    9290    9400    9720    9620    9840   11000    10
 sum.c_across()  341000  348000  353000  356000  359000  360000    10
     sum.nest()  793000  827000  854000  843000  871000  945000    10
   sum.across() 4810000 4830000 4880000 4900000 4920000 4940000    10

Visualizing this (without the outlier sum.across ) facilitates the comparison:

在此处输入图像描述

Conclusion (subjective!)

  1. Despite great readability, nest and rowwise / c_across are not recommendable for larger datasets (> 100.000 rows or repeated actions)
  2. The explicit sum wins because it leverages internally the best the vectorization of the sum function, which is also leveraged by the rowSums but with a little computational overhead
  3. The purrr::reduce is relatively new in the tidyverse (but well known in python), and as Reduce in base R very efficient, thus winning a place among the Top3. Because the explicit form is cumbersome to write, and there are not many vectorized methods other than rowSums / rowMeans , colSums / colMeans , I would recommend for all other functions (eg sd ) to apply purrr::reduce .

In case you want to sum across columns or rows using a vector but in this case modifying the df instead of add a new column to df.

You can use the sweep function:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))
> df
   x1 x2 x3 x4 x5
1   1  1  0  1  1
2   0  1  1  0  1
3   0 NA  0 NA NA
4  NA  1  1  1  1
5   0  1  1  0  1
6   1  0  0  0  1
7   1 NA NA NA NA
8  NA NA NA  0  1
9   0  0  0  0  0
10  1  1  1  1  1

Sum (vector + dataframe) in row-wise order:

vector = 1:5
sweep(df, MARGIN=2, vector, `+`)
   x1 x2 x3 x4 x5
1   2  3  3  5  6
2   1  3  4  4  6
3   1 NA  3 NA NA
4  NA  3  4  5  6
5   1  3  4  4  6
6   2  2  3  4  6
7   2 NA NA NA NA
8  NA NA NA  4  6
9   1  2  3  4  5
10  2  3  4  5  6

Sum (vector + dataframe) in column-wise order:

vector <- 1:10  
sweep(df, MARGIN=1, vector, `+`)
   x1 x2 x3 x4 x5
1   2  2  1  2  2
2   2  3  3  2  3
3   3 NA  3 NA NA
4  NA  5  5  5  5
5   5  6  6  5  6
6   7  6  6  6  7
7   8 NA NA NA NA
8  NA NA NA  8  9
9   9  9  9  9  9
10 11 11 11 11 11

This the same to say vector+df

  • MARGIN = 1 is column-wise
  • MARGIN = 2 is row-wise.

And Yes. You can use sweep with:

sweep(df, MARGIN=2, vector, `-`)
sweep(df, MARGIN=2, vector, `*`)
sweep(df, MARGIN=2, vector, `/`)
sweep(df, MARGIN=2, vector, `^`)

Another Way is using Reduce with column-wise:

vector = 1:5
.df <- list(df, vector)
Reduce('+', .df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM