简体   繁体   English

跨多列求和 dplyr

[英]Sum across multiple columns with dplyr

My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr .我的问题涉及对数据框的多个列的值求和,并使用dplyr创建与该求和相对应的新列。 The data entries in the columns are binary(0,1).列中的数据条目是二进制 (0,1)。 I am thinking of a row-wise analog of the summarise_each or mutate_each function of dplyr .我正在考虑 dplyr 的 summarise_each 或mutate_each summarise_eachdplyr Below is a minimal example of the data frame:下面是数据框的一个最小示例:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))

> df
   x1 x2 x3 x4 x5
1   1  1  0  1  1
2   0  1  1  0  1
3   0 NA  0 NA NA
4  NA  1  1  1  1
5   0  1  1  0  1
6   1  0  0  0  1
7   1 NA NA NA NA
8  NA NA NA  0  1
9   0  0  0  0  0
10  1  1  1  1  1

I could use something like:我可以使用类似的东西:

df <- df %>% mutate(sumrow= x1 + x2 + x3 + x4 + x5)

but this would involve writing out the names of each of the columns.但这将涉及写出每一列的名称。 I have like 50 columns.我有 50 列。 In addition, the column names change at different iterations of the loop in which I want to implement this operation so I would like to try avoid having to give any column names.此外,列名在我想要实现此操作的循环的不同迭代中发生变化,因此我想尽量避免必须提供任何列名。

How can I do that most efficiently?我怎样才能最有效地做到这一点? Any assistance would be greatly appreciated.任何帮助将不胜感激。

dplyr >= 1.0.0 using across dplyr >= 1.0.0 使用跨

sum up each row using rowSums ( rowwise works for any aggreation, but is slower)使用rowSums对每一行rowSumsrowwise适用于任何rowwise ,但速度较慢)

df %>%
   replace(is.na(.), 0) %>%
   mutate(sum = rowSums(across(where(is.numeric))))

sum down each column总结每一列

df %>%
   summarise(across(everything(), ~ sum(., is.na(.), 0)))

dplyr < 1.0.0 dplyr < 1.0.0

sum up each row总结每一行

df %>%
   replace(is.na(.), 0) %>%
   mutate(sum = rowSums(.[1:5]))

sum down each column using superseeded summarise_all :使用superseed summarise_all 对每一列summarise_all

df %>%
   replace(is.na(.), 0) %>%
   summarise_all(funs(sum))

If you want to sum certain columns only, I'd use something like this:如果你只想对某些列求和,我会使用这样的东西:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>% select(x3:x5) %>% rowSums(na.rm=TRUE) -> df$x3x5.total
head(df)

This way you can use dplyr::select 's syntax.这样您就可以使用dplyr::select的语法。

I would use regular expression matching to sum over variables with certain pattern names.我会使用正则表达式匹配来对具有特定模式名称的变量求和。 For example:例如:

df <- df %>% mutate(sum1 = rowSums(.[grep("x[3-5]", names(.))], na.rm = TRUE),
                    sum_all = rowSums(.[grep("x", names(.))], na.rm = TRUE))

This way you can create more than one variable as a sum of certain group of variables of your data frame.通过这种方式,您可以创建多个变量作为数据框的某些变量组的总和。

Using reduce() from purrr is slightly faster than rowSums and definately faster than apply , since you avoid iterating over all the rows and just take advantage of the vectorized operations:使用来自purrr reduce()rowSums略快,并且肯定比apply快,因为您避免迭代所有行并仅利用矢量化操作:

library(purrr)
library(dplyr)
iris %>% mutate(Petal = reduce(select(., starts_with("Petal")), `+`))

See this for timings这个时间

dplyr >= 1.0.0 dplyr >= 1.0.0

In newer versions of dplyr you can use rowwise() along with c_across to perform row-wise aggregation for functions that do not have specific row-wise variants, but if the row-wise variant exists it should be faster.在较新版本的dplyr您可以使用rowwise()c_across为没有特定行变体的函数执行行聚合,但如果存在行变体,它应该更快。

Since rowwise() is just a special form of grouping and changes the way verbs work you'll likely want to pipe it to ungroup() after doing your row-wise operation.由于rowwise()只是一种特殊的分组形式并改变了动词的工作方式,因此您可能希望在执行逐行操作后将其通过管道传递给ungroup()

To select a range by name :要按名称选择范围

df %>%
  rowwise() %>% 
  mutate(sumrange = sum(c_across(x1:x5), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()

To select by type :按类型选择:

df %>%
  rowwise() %>% 
  mutate(sumnumeric = sum(c_across(where(is.numeric)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()

To select by column name :按列名选择:

You can use any number of tidy selection helpers like starts_with , ends_with , contains , etc.您可以使用任意数量的tidy selection helper,starts_withends_withcontains等。

df %>%
    rowwise() %>% 
    mutate(sum_startswithx = sum(c_across(starts_with("x")), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()

To select by column index :按列索引选择:

df %>% 
  rowwise() %>% 
  mutate(sumindex = sum(c_across(c(1:4, 5)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()

rowise() will work for any summary function . rowise()将适用于任何汇总函数 However, in your specific case a row-wise variant exists ( rowSums ) so you can do the following (note the use of across instead), which will be faster:然而,在特定情况下,在行变体的话( rowSums ),所以你可以做以下的(请注意使用的across代替),这会更快:

df %>%
  mutate(sumrow = rowSums(across(x1:x5), na.rm = T))

For more information see the page on rowwise .有关更多信息,请参阅rowwise页面。


Benchmarking基准测试

For this example, the the row-wise variant rowSums takes about half as much time:对于此示例,行式变体rowSums花费的时间大约是其一半:

library(microbenchmark)

microbenchmark(
  df %>%
    dplyr::rowwise() %>% 
    dplyr::mutate(sumrange = sum(dplyr::c_across(x1:x5), na.rm = T)),
  df %>%
    dplyr::mutate(sumrow = rowSums(dplyr::across(x1:x5), na.rm = T)),
  times = 1000L
)

    min    lq     mean  median      uq     max neval cld
 5.5256 6.256 7.024232 6.58885 7.02325 22.1911  1000   b
 2.7011 3.112 3.661106 3.41070 3.71975 32.6282  1000  a 

c_across versus across c_across 与跨

In the particular case of the sum function, across and c_across give the same output for much of the code above:在的特定情况下sum函数, acrossc_across给出相同的输出为多上面的代码的:

sum_across <- df %>%
    rowwise() %>% 
    mutate(sumrange = sum(across(x1:x5), na.rm = T))

sum_c_across <- df %>%
    rowwise() %>% 
    mutate(sumrange = sum(c_across(x1:x5), na.rm = T)

all.equal(sum_across, sum_c_across)
[1] TRUE

The row-wise output of c_across is a vector (hence the c_ ), while the row-wise output of across is a 1-row tibble object:的逐行输出c_across是一个矢量(因此c_ ),而在行输出across是1行tibble对象:

df %>% 
  rowwise() %>% 
  mutate(c_across = list(c_across(x1:x5)),
         across = list(across(x1:x5)),
         .keep = "unused") %>% 
  ungroup() 

# A tibble: 10 x 2
   c_across  across          
   <list>    <list>          
 1 <dbl [5]> <tibble [1 x 5]>
 2 <dbl [5]> <tibble [1 x 5]>
 3 <dbl [5]> <tibble [1 x 5]>
 4 <dbl [5]> <tibble [1 x 5]>
 5 <dbl [5]> <tibble [1 x 5]>
 6 <dbl [5]> <tibble [1 x 5]>
 7 <dbl [5]> <tibble [1 x 5]>
 8 <dbl [5]> <tibble [1 x 5]>
 9 <dbl [5]> <tibble [1 x 5]>
10 <dbl [5]> <tibble [1 x 5]>

The function you want to apply will necessitate, which verb you use.您要应用的功能将需要您使用哪个动词。 As shown above with sum you can use them nearly interchangeably.正如上图所示sum ,你几乎可以互换使用。 However, mean and many other common functions expect a (numeric) vector as its first argument:然而, mean和许多其他常见函数都期望一个(数字)向量作为它的第一个参数:

class(df[1,])
"data.frame"

sum(df[1,]) # works with data.frame
[1] 4

mean(df[1,]) # does not work with data.frame
[1] NA
Warning message:
In mean.default(df[1, ]) : argument is not numeric or logical: returning NA
class(unname(unlist(df[1,])))
"numeric"

sum(unname(unlist(df[1,]))) # works with numeric vector
[1] 4

mean(unname(unlist(df[1,]))) # works with numeric vector
[1] 0.8

Ignoring the row-wise variant that exists for mean ( rowMean ) then in this case c_across should be used:忽略均值 ( rowMean ) 存在的逐行变体,则在这种情况下应使用c_across

df %>% 
  rowwise() %>% 
  mutate(avg = mean(c_across(x1:x5), na.rm = T)) %>% 
  ungroup()

# A tibble: 10 x 6
      x1    x2    x3    x4    x5   avg
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1     1     0     1     1   0.8
 2     0     1     1     0     1   0.6
 3     0    NA     0    NA    NA   0  
 4    NA     1     1     1     1   1  
 5     0     1     1     0     1   0.6
 6     1     0     0     0     1   0.4
 7     1    NA    NA    NA    NA   1  
 8    NA    NA    NA     0     1   0.5
 9     0     0     0     0     0   0  
10     1     1     1     1     1   1  

# Does not work
df %>% 
  rowwise() %>% 
  mutate(avg = mean(across(x1:x5), na.rm = T)) %>% 
  ungroup()

rowSums , rowMeans , etc. can take a numeric data frame as the first argument, which is why they work with across . rowSumsrowMeans等可采取一个数字数据帧作为第一个参数,这就是为什么它们一起工作across

I encounter this problem often, and the easiest way to do this is to use the apply() function within a mutate command.我经常遇到这个问题,最简单的方法是在mutate命令中使用apply()函数。

library(tidyverse)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))

df %>%
  mutate(sum = select(., x1:x5) %>% apply(1, sum, na.rm=TRUE))

Here you could use whatever you want to select the columns using the standard dplyr tricks (eg starts_with() or contains() ).在这里,您可以使用标准dplyr技巧(例如starts_with()contains() )使用任何您想要选择的列。 By doing all the work within a single mutate command, this action can occur anywhere within a dplyr stream of processing steps.通过在单个mutate命令中完成所有工作,此操作可以发生在dplyr处理步骤流中的任何位置。 Finally, by using the apply() function, you have the flexibility to use whatever summary you need, including your own purpose built summarization function.最后,通过使用apply()函数,您可以灵活地使用您需要的任何摘要,包括您自己专门构建的摘要函数。

Alternatively, if the idea of using a non-tidyverse function is unappealing, then you could gather up the columns, summarize them and finally join the result back to the original data frame.或者,如果使用非 tidyverse 函数的想法没有吸引力,那么您可以收集列,汇总它们,最后将结果连接回原始数据框。

df <- df %>% mutate( id = 1:n() )   # Need some ID column for this to work

df <- df %>%
  group_by(id) %>%
  gather('Key', 'value', starts_with('x')) %>%
  summarise( Key.Sum = sum(value) ) %>%
  left_join( df, . )

Here I used the starts_with() function to select the columns and calculated the sum and you can do whatever you want with NA values.在这里,我使用了starts_with()函数来选择列并计算总和,你可以对NA值做任何你想做的事情。 The downside to this approach is that while it is pretty flexible, it doesn't really fit into a dplyr stream of data cleaning steps.这种方法的缺点是,虽然它非常灵活,但它并不真正适合数据清理步骤的dplyr流。

Benchmarking (almost) all options to sum across columns对(几乎)所有选项进行基准测试以跨列求和

As it's difficult to decide among all the interesting answers given by @skd, @LMc, and others, I benchmarked all alternatives which are reasonably long.由于很难在@skd、@LMc 和其他人给出的所有有趣答案中做出决定,我对所有相当长的备选方案进行了基准测试。

The difference to other examples is that I used a larger dataset (10.000 rows) and from a real world dataset (diamonds), so the findings might reflect more the variance of real world data.与其他示例的不同之处在于,我使用了更大的数据集(10.000 行)和来自真实世界数据集(菱形)的数据集,因此这些发现可能更多地反映了真实世界数据的差异。

The reproducible benchmarking code is:可重现的基准测试代码是:

set.seed(17)
dataset <- diamonds %>% sample_n(1e4)
cols <- c("depth", "table", "x", "y", "z")

sum.explicit <- function() {
  dataset %>%
    mutate(sum.cols = depth + table + x + y + z)
}

sum.rowSums <- function() {
  dataset %>%
    mutate(sum.cols = rowSums(across(cols)))
}

sum.reduce <- function() {
  dataset %>%
    mutate(sum.cols = purrr::reduce(select(., cols), `+`))
}

sum.nest <- function() {
  dataset %>%
  group_by(id = row_number()) %>%
  nest(data = cols) %>%
  mutate(sum.cols = map_dbl(data, sum))
}

# NOTE: across with rowwise doesn't work with all functions!
sum.across <- function() {
  dataset %>%
    rowwise() %>%
    mutate(sum.cols = sum(across(cols)))
}

sum.c_across <- function() {
  dataset %>%
  rowwise() %>%
  mutate(sum.cols = sum(c_across(cols)))
}

sum.apply <- function() {
  dataset %>%
    mutate(sum.cols = select(., cols) %>%
             apply(1, sum, na.rm = TRUE))
}

bench <- microbenchmark::microbenchmark(
  sum.nest(),
  sum.across(),
  sum.c_across(),
  sum.apply(),
  sum.explicit(),
  sum.reduce(),
  sum.rowSums(),
  times = 10
)

bench %>% print(order = 'mean', signif = 3)
Unit: microseconds
           expr     min      lq    mean  median      uq     max neval
 sum.explicit()     796     839    1160     950    1040    3160    10
  sum.rowSums()    1430    1450    1770    1650    1800    2980    10
   sum.reduce()    1650    1700    2090    2000    2140    3300    10
    sum.apply()    9290    9400    9720    9620    9840   11000    10
 sum.c_across()  341000  348000  353000  356000  359000  360000    10
     sum.nest()  793000  827000  854000  843000  871000  945000    10
   sum.across() 4810000 4830000 4880000 4900000 4920000 4940000    10

Visualizing this (without the outlier sum.across ) facilitates the comparison:可视化这一点(没有离群值sum.across )有助于比较:

在此处输入图像描述

Conclusion (subjective!)结论(主观!)

  1. Despite great readability, nest and rowwise / c_across are not recommendable for larger datasets (> 100.000 rows or repeated actions)尽管可读性很好,但对于较大的数据集(> 100.000 行或重复操作),不推荐nestrowwise / c_across
  2. The explicit sum wins because it leverages internally the best the vectorization of the sum function, which is also leveraged by the rowSums but with a little computational overhead显式总和获胜,因为它在内部最好地利用了总和 function 的矢量化, rowSums也利用了它,但计算开销很小
  3. The purrr::reduce is relatively new in the tidyverse (but well known in python), and as Reduce in base R very efficient, thus winning a place among the Top3. purrr::reduce在 tidyverse 中相对较新(但在 python 中众所周知),并且作为基础 R 中的Reduce非常高效,因此在 Top3 中占有一席之地。 Because the explicit form is cumbersome to write, and there are not many vectorized methods other than rowSums / rowMeans , colSums / colMeans , I would recommend for all other functions (eg sd ) to apply purrr::reduce .因为显式形式写起来很麻烦,而且除了rowSums / rowMeanscolSums / colMeans之外没有太多矢量化方法,我建议所有其他函数(例如sd )应用purrr::reduce

In case you want to sum across columns or rows using a vector but in this case modifying the df instead of add a new column to df.如果您想使用向量对列或行求和,但在这种情况下修改 df 而不是向 df 添加新列。

You can use the sweep function:可以使用扫一扫function:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))
> df
   x1 x2 x3 x4 x5
1   1  1  0  1  1
2   0  1  1  0  1
3   0 NA  0 NA NA
4  NA  1  1  1  1
5   0  1  1  0  1
6   1  0  0  0  1
7   1 NA NA NA NA
8  NA NA NA  0  1
9   0  0  0  0  0
10  1  1  1  1  1

Sum (vector + dataframe) in row-wise order:按行顺序求和(向量+数据帧):

vector = 1:5
sweep(df, MARGIN=2, vector, `+`)
   x1 x2 x3 x4 x5
1   2  3  3  5  6
2   1  3  4  4  6
3   1 NA  3 NA NA
4  NA  3  4  5  6
5   1  3  4  4  6
6   2  2  3  4  6
7   2 NA NA NA NA
8  NA NA NA  4  6
9   1  2  3  4  5
10  2  3  4  5  6

Sum (vector + dataframe) in column-wise order:按列顺序求和(向量+数据帧):

vector <- 1:10  
sweep(df, MARGIN=1, vector, `+`)
   x1 x2 x3 x4 x5
1   2  2  1  2  2
2   2  3  3  2  3
3   3 NA  3 NA NA
4  NA  5  5  5  5
5   5  6  6  5  6
6   7  6  6  6  7
7   8 NA NA NA NA
8  NA NA NA  8  9
9   9  9  9  9  9
10 11 11 11 11 11

This the same to say vector+df这与vector+df相同

  • MARGIN = 1 is column-wise MARGIN = 1 是逐列的
  • MARGIN = 2 is row-wise. MARGIN = 2 是逐行的。

And Yes.是的。 You can use sweep with:您可以使用扫描:

sweep(df, MARGIN=2, vector, `-`)
sweep(df, MARGIN=2, vector, `*`)
sweep(df, MARGIN=2, vector, `/`)
sweep(df, MARGIN=2, vector, `^`)

Another Way is using Reduce with column-wise:另一种方法是按列使用 Reduce:

vector = 1:5
.df <- list(df, vector)
Reduce('+', .df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM