[英]Sum across multiple columns with dplyr
My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr
.我的问题涉及对数据框的多个列的值求和,并使用
dplyr
创建与该求和相对应的新列。 The data entries in the columns are binary(0,1).列中的数据条目是二进制 (0,1)。 I am thinking of a row-wise analog of the
summarise_each
or mutate_each
function of dplyr
.我正在考虑 dplyr 的 summarise_each 或
mutate_each
summarise_each
的dplyr
。 Below is a minimal example of the data frame:下面是数据框的一个最小示例:
library(dplyr)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
> df
x1 x2 x3 x4 x5
1 1 1 0 1 1
2 0 1 1 0 1
3 0 NA 0 NA NA
4 NA 1 1 1 1
5 0 1 1 0 1
6 1 0 0 0 1
7 1 NA NA NA NA
8 NA NA NA 0 1
9 0 0 0 0 0
10 1 1 1 1 1
I could use something like:我可以使用类似的东西:
df <- df %>% mutate(sumrow= x1 + x2 + x3 + x4 + x5)
but this would involve writing out the names of each of the columns.但这将涉及写出每一列的名称。 I have like 50 columns.
我有 50 列。 In addition, the column names change at different iterations of the loop in which I want to implement this operation so I would like to try avoid having to give any column names.
此外,列名在我想要实现此操作的循环的不同迭代中发生变化,因此我想尽量避免必须提供任何列名。
How can I do that most efficiently?我怎样才能最有效地做到这一点? Any assistance would be greatly appreciated.
任何帮助将不胜感激。
sum up each row using rowSums
( rowwise
works for any aggreation, but is slower)使用
rowSums
对每一行rowSums
( rowwise
适用于任何rowwise
,但速度较慢)
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(across(where(is.numeric))))
sum down each column总结每一列
df %>%
summarise(across(everything(), ~ sum(., is.na(.), 0)))
sum up each row总结每一行
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(.[1:5]))
sum down each column using superseeded summarise_all
:使用superseed summarise_all 对每一列
summarise_all
:
df %>%
replace(is.na(.), 0) %>%
summarise_all(funs(sum))
If you want to sum certain columns only, I'd use something like this:如果你只想对某些列求和,我会使用这样的东西:
library(dplyr)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>% select(x3:x5) %>% rowSums(na.rm=TRUE) -> df$x3x5.total
head(df)
This way you can use dplyr::select
's syntax.这样您就可以使用
dplyr::select
的语法。
I would use regular expression matching to sum over variables with certain pattern names.我会使用正则表达式匹配来对具有特定模式名称的变量求和。 For example:
例如:
df <- df %>% mutate(sum1 = rowSums(.[grep("x[3-5]", names(.))], na.rm = TRUE),
sum_all = rowSums(.[grep("x", names(.))], na.rm = TRUE))
This way you can create more than one variable as a sum of certain group of variables of your data frame.通过这种方式,您可以创建多个变量作为数据框的某些变量组的总和。
Using reduce()
from purrr
is slightly faster than rowSums
and definately faster than apply
, since you avoid iterating over all the rows and just take advantage of the vectorized operations:使用来自
purrr
reduce()
比rowSums
略快,并且肯定比apply
快,因为您避免迭代所有行并仅利用矢量化操作:
library(purrr)
library(dplyr)
iris %>% mutate(Petal = reduce(select(., starts_with("Petal")), `+`))
In newer versions of dplyr
you can use rowwise()
along with c_across
to perform row-wise aggregation for functions that do not have specific row-wise variants, but if the row-wise variant exists it should be faster.在较新版本的
dplyr
您可以使用rowwise()
和c_across
为没有特定行变体的函数执行行聚合,但如果存在行变体,它应该更快。
Since rowwise()
is just a special form of grouping and changes the way verbs work you'll likely want to pipe it to ungroup()
after doing your row-wise operation.由于
rowwise()
只是一种特殊的分组形式并改变了动词的工作方式,因此您可能希望在执行逐行操作后将其通过管道传递给ungroup()
。
To select a range by name :要按名称选择范围:
df %>%
rowwise() %>%
mutate(sumrange = sum(c_across(x1:x5), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
To select by type :按类型选择:
df %>%
rowwise() %>%
mutate(sumnumeric = sum(c_across(where(is.numeric)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
To select by column name :按列名选择:
You can use any number of tidy selection helpers like starts_with
, ends_with
, contains
, etc.您可以使用任意数量的tidy selection helper,如
starts_with
、 ends_with
、 contains
等。
df %>%
rowwise() %>%
mutate(sum_startswithx = sum(c_across(starts_with("x")), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
To select by column index :按列索引选择:
df %>%
rowwise() %>%
mutate(sumindex = sum(c_across(c(1:4, 5)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
rowise()
will work for any summary function . rowise()
将适用于任何汇总函数。 However, in your specific case a row-wise variant exists ( rowSums
) so you can do the following (note the use of across
instead), which will be faster:然而,在特定情况下,在行变体的话(
rowSums
),所以你可以做以下的(请注意使用的across
代替),这会更快:
df %>%
mutate(sumrow = rowSums(across(x1:x5), na.rm = T))
For more information see the page on rowwise .有关更多信息,请参阅rowwise页面。
Benchmarking基准测试
For this example, the the row-wise variant rowSums
takes about half as much time:对于此示例,行式变体
rowSums
花费的时间大约是其一半:
library(microbenchmark)
microbenchmark(
df %>%
dplyr::rowwise() %>%
dplyr::mutate(sumrange = sum(dplyr::c_across(x1:x5), na.rm = T)),
df %>%
dplyr::mutate(sumrow = rowSums(dplyr::across(x1:x5), na.rm = T)),
times = 1000L
)
min lq mean median uq max neval cld
5.5256 6.256 7.024232 6.58885 7.02325 22.1911 1000 b
2.7011 3.112 3.661106 3.41070 3.71975 32.6282 1000 a
c_across versus across c_across 与跨
In the particular case of the sum
function, across
and c_across
give the same output for much of the code above:在的特定情况下
sum
函数, across
和c_across
给出相同的输出为多上面的代码的:
sum_across <- df %>%
rowwise() %>%
mutate(sumrange = sum(across(x1:x5), na.rm = T))
sum_c_across <- df %>%
rowwise() %>%
mutate(sumrange = sum(c_across(x1:x5), na.rm = T)
all.equal(sum_across, sum_c_across)
[1] TRUE
The row-wise output of c_across
is a vector (hence the c_
), while the row-wise output of across
is a 1-row tibble
object:的逐行输出
c_across
是一个矢量(因此c_
),而在行输出across
是1行tibble
对象:
df %>%
rowwise() %>%
mutate(c_across = list(c_across(x1:x5)),
across = list(across(x1:x5)),
.keep = "unused") %>%
ungroup()
# A tibble: 10 x 2
c_across across
<list> <list>
1 <dbl [5]> <tibble [1 x 5]>
2 <dbl [5]> <tibble [1 x 5]>
3 <dbl [5]> <tibble [1 x 5]>
4 <dbl [5]> <tibble [1 x 5]>
5 <dbl [5]> <tibble [1 x 5]>
6 <dbl [5]> <tibble [1 x 5]>
7 <dbl [5]> <tibble [1 x 5]>
8 <dbl [5]> <tibble [1 x 5]>
9 <dbl [5]> <tibble [1 x 5]>
10 <dbl [5]> <tibble [1 x 5]>
The function you want to apply will necessitate, which verb you use.您要应用的功能将需要您使用哪个动词。 As shown above with
sum
you can use them nearly interchangeably.正如上图所示
sum
,你几乎可以互换使用。 However, mean
and many other common functions expect a (numeric) vector as its first argument:然而,
mean
和许多其他常见函数都期望一个(数字)向量作为它的第一个参数:
class(df[1,])
"data.frame"
sum(df[1,]) # works with data.frame
[1] 4
mean(df[1,]) # does not work with data.frame
[1] NA
Warning message:
In mean.default(df[1, ]) : argument is not numeric or logical: returning NA
class(unname(unlist(df[1,])))
"numeric"
sum(unname(unlist(df[1,]))) # works with numeric vector
[1] 4
mean(unname(unlist(df[1,]))) # works with numeric vector
[1] 0.8
Ignoring the row-wise variant that exists for mean ( rowMean
) then in this case c_across
should be used:忽略均值 (
rowMean
) 存在的逐行变体,则在这种情况下应使用c_across
:
df %>%
rowwise() %>%
mutate(avg = mean(c_across(x1:x5), na.rm = T)) %>%
ungroup()
# A tibble: 10 x 6
x1 x2 x3 x4 x5 avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 1 1 0.8
2 0 1 1 0 1 0.6
3 0 NA 0 NA NA 0
4 NA 1 1 1 1 1
5 0 1 1 0 1 0.6
6 1 0 0 0 1 0.4
7 1 NA NA NA NA 1
8 NA NA NA 0 1 0.5
9 0 0 0 0 0 0
10 1 1 1 1 1 1
# Does not work
df %>%
rowwise() %>%
mutate(avg = mean(across(x1:x5), na.rm = T)) %>%
ungroup()
rowSums
, rowMeans
, etc. can take a numeric data frame as the first argument, which is why they work with across
. rowSums
, rowMeans
等可采取一个数字数据帧作为第一个参数,这就是为什么它们一起工作across
。
I encounter this problem often, and the easiest way to do this is to use the apply()
function within a mutate
command.我经常遇到这个问题,最简单的方法是在
mutate
命令中使用apply()
函数。
library(tidyverse)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>%
mutate(sum = select(., x1:x5) %>% apply(1, sum, na.rm=TRUE))
Here you could use whatever you want to select the columns using the standard dplyr
tricks (eg starts_with()
or contains()
).在这里,您可以使用标准
dplyr
技巧(例如starts_with()
或contains()
)使用任何您想要选择的列。 By doing all the work within a single mutate
command, this action can occur anywhere within a dplyr
stream of processing steps.通过在单个
mutate
命令中完成所有工作,此操作可以发生在dplyr
处理步骤流中的任何位置。 Finally, by using the apply()
function, you have the flexibility to use whatever summary you need, including your own purpose built summarization function.最后,通过使用
apply()
函数,您可以灵活地使用您需要的任何摘要,包括您自己专门构建的摘要函数。
Alternatively, if the idea of using a non-tidyverse function is unappealing, then you could gather up the columns, summarize them and finally join the result back to the original data frame.或者,如果使用非 tidyverse 函数的想法没有吸引力,那么您可以收集列,汇总它们,最后将结果连接回原始数据框。
df <- df %>% mutate( id = 1:n() ) # Need some ID column for this to work
df <- df %>%
group_by(id) %>%
gather('Key', 'value', starts_with('x')) %>%
summarise( Key.Sum = sum(value) ) %>%
left_join( df, . )
Here I used the starts_with()
function to select the columns and calculated the sum and you can do whatever you want with NA
values.在这里,我使用了
starts_with()
函数来选择列并计算总和,你可以对NA
值做任何你想做的事情。 The downside to this approach is that while it is pretty flexible, it doesn't really fit into a dplyr
stream of data cleaning steps.这种方法的缺点是,虽然它非常灵活,但它并不真正适合数据清理步骤的
dplyr
流。
As it's difficult to decide among all the interesting answers given by @skd, @LMc, and others, I benchmarked all alternatives which are reasonably long.由于很难在@skd、@LMc 和其他人给出的所有有趣答案中做出决定,我对所有相当长的备选方案进行了基准测试。
The difference to other examples is that I used a larger dataset (10.000 rows) and from a real world dataset (diamonds), so the findings might reflect more the variance of real world data.与其他示例的不同之处在于,我使用了更大的数据集(10.000 行)和来自真实世界数据集(菱形)的数据集,因此这些发现可能更多地反映了真实世界数据的差异。
The reproducible benchmarking code is:可重现的基准测试代码是:
set.seed(17)
dataset <- diamonds %>% sample_n(1e4)
cols <- c("depth", "table", "x", "y", "z")
sum.explicit <- function() {
dataset %>%
mutate(sum.cols = depth + table + x + y + z)
}
sum.rowSums <- function() {
dataset %>%
mutate(sum.cols = rowSums(across(cols)))
}
sum.reduce <- function() {
dataset %>%
mutate(sum.cols = purrr::reduce(select(., cols), `+`))
}
sum.nest <- function() {
dataset %>%
group_by(id = row_number()) %>%
nest(data = cols) %>%
mutate(sum.cols = map_dbl(data, sum))
}
# NOTE: across with rowwise doesn't work with all functions!
sum.across <- function() {
dataset %>%
rowwise() %>%
mutate(sum.cols = sum(across(cols)))
}
sum.c_across <- function() {
dataset %>%
rowwise() %>%
mutate(sum.cols = sum(c_across(cols)))
}
sum.apply <- function() {
dataset %>%
mutate(sum.cols = select(., cols) %>%
apply(1, sum, na.rm = TRUE))
}
bench <- microbenchmark::microbenchmark(
sum.nest(),
sum.across(),
sum.c_across(),
sum.apply(),
sum.explicit(),
sum.reduce(),
sum.rowSums(),
times = 10
)
bench %>% print(order = 'mean', signif = 3)
Unit: microseconds
expr min lq mean median uq max neval
sum.explicit() 796 839 1160 950 1040 3160 10
sum.rowSums() 1430 1450 1770 1650 1800 2980 10
sum.reduce() 1650 1700 2090 2000 2140 3300 10
sum.apply() 9290 9400 9720 9620 9840 11000 10
sum.c_across() 341000 348000 353000 356000 359000 360000 10
sum.nest() 793000 827000 854000 843000 871000 945000 10
sum.across() 4810000 4830000 4880000 4900000 4920000 4940000 10
Visualizing this (without the outlier sum.across
) facilitates the comparison:可视化这一点(没有离群值
sum.across
)有助于比较:
nest
and rowwise
/ c_across
are not recommendable for larger datasets (> 100.000 rows or repeated actions)nest
和rowwise
/ c_across
rowSums
but with a little computational overheadrowSums
也利用了它,但计算开销很小purrr::reduce
is relatively new in the tidyverse (but well known in python), and as Reduce
in base R very efficient, thus winning a place among the Top3. purrr::reduce
在 tidyverse 中相对较新(但在 python 中众所周知),并且作为基础 R 中的Reduce
非常高效,因此在 Top3 中占有一席之地。 Because the explicit form is cumbersome to write, and there are not many vectorized methods other than rowSums
/ rowMeans
, colSums
/ colMeans
, I would recommend for all other functions (eg sd
) to apply purrr::reduce
.rowSums
/ rowMeans
、 colSums
/ colMeans
之外没有太多矢量化方法,我建议所有其他函数(例如sd
)应用purrr::reduce
。 In case you want to sum across columns or rows using a vector but in this case modifying the df instead of add a new column to df.如果您想使用向量对列或行求和,但在这种情况下修改 df 而不是向 df 添加新列。
You can use the sweep function:可以使用扫一扫function:
library(dplyr)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
> df
x1 x2 x3 x4 x5
1 1 1 0 1 1
2 0 1 1 0 1
3 0 NA 0 NA NA
4 NA 1 1 1 1
5 0 1 1 0 1
6 1 0 0 0 1
7 1 NA NA NA NA
8 NA NA NA 0 1
9 0 0 0 0 0
10 1 1 1 1 1
Sum (vector + dataframe) in row-wise order:按行顺序求和(向量+数据帧):
vector = 1:5
sweep(df, MARGIN=2, vector, `+`)
x1 x2 x3 x4 x5
1 2 3 3 5 6
2 1 3 4 4 6
3 1 NA 3 NA NA
4 NA 3 4 5 6
5 1 3 4 4 6
6 2 2 3 4 6
7 2 NA NA NA NA
8 NA NA NA 4 6
9 1 2 3 4 5
10 2 3 4 5 6
Sum (vector + dataframe) in column-wise order:按列顺序求和(向量+数据帧):
vector <- 1:10
sweep(df, MARGIN=1, vector, `+`)
x1 x2 x3 x4 x5
1 2 2 1 2 2
2 2 3 3 2 3
3 3 NA 3 NA NA
4 NA 5 5 5 5
5 5 6 6 5 6
6 7 6 6 6 7
7 8 NA NA NA NA
8 NA NA NA 8 9
9 9 9 9 9 9
10 11 11 11 11 11
This the same to say vector+df
这与
vector+df
相同
And Yes.是的。 You can use sweep with:
您可以使用扫描:
sweep(df, MARGIN=2, vector, `-`)
sweep(df, MARGIN=2, vector, `*`)
sweep(df, MARGIN=2, vector, `/`)
sweep(df, MARGIN=2, vector, `^`)
Another Way is using Reduce with column-wise:另一种方法是按列使用 Reduce:
vector = 1:5
.df <- list(df, vector)
Reduce('+', .df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.