[英]Why dplyr code which precomputes the maximum of a column is slower than dplyr code which computes it inside the mutate call?
Sample data frame:示例数据框:
ngroups <- 100
nsamples <- 1000
foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))
I want to find the max
of cycles
for each engine
group, and create a new variable tte = max(cycles) - cycles
with mutate
.我想找到每个
engine
组的max
cycles
数,并创建一个新变量tte = max(cycles) - cycles
带有mutate
tte = max(cycles) - cycles
。 I thought that if I would precompute the column of maximum cycles, instead than recomputing it inside the mutate
command for each row, the code would be faster.我认为如果我预先计算最大周期的列,而不是在每一行的
mutate
命令中重新计算它,代码会更快。 Turns out I'm wrong:结果我错了:
library(microbenchmark)
library(dplyr)
library(magrittr)
add_tte <- function(dataset){
dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles)) %>%
mutate(tte = max_cycles - cycles) %>% select(-max_cycles) %>% ungroup
}
add_tte_old <- function(dataset){
dataset %<>% group_by(engine) %>% mutate(tte = max(cycles) - cycles) %>% ungroup
}
microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr min lq mean median uq max neval
# add_tte(foo) 17.45324 21.107264 26.50535 24.52625 28.75208 113.98433 500
# add_tte_old(foo) 8.10376 9.949188 13.35830 12.18336 14.52474 77.64578 500
Why is this happening?为什么会这样? Is the reason that
dplyr
computes the maximum just once for group, instead that once for row?是
dplyr
只为组计算一次最大值,而不是为行计算一次最大值的原因吗?
EDIT : even if I use a single mutate
statement in add_tte
, and I create a bigger example, add_tte_old
is still faster编辑:即使我在
add_tte
使用单个mutate
语句,并且我创建了一个更大的示例, add_tte_old
仍然更快
# these are the only lines of code modified, the rest is as before
nsamples <- 10000
foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))
add_tte <- function(dataset){
dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles), tte = max_cycles - cycles) %>%
select(-max_cycles) %>% ungroup
}
# the new results are:
microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr min lq mean median uq max neval
# add_tte(foo) 90.46658 107.14015 139.13570 131.83689 158.24358 411.3272 500
# add_tte_old(foo) 39.38357 46.13531 62.57386 52.00782 69.26815 176.1512 500
You have made some wrong assumptions, but besides that, more importantly, you are not comparing like-wise.您做出了一些错误的假设,但除此之外,更重要的是,您没有进行类似的比较。
It would make more sense to look at the two variants below:看看下面的两个变体会更有意义:
add_tte <- function(dataset) {
dataset %<>% group_by(engine) %>% mutate(max_cycles = rep(max(cycles), times = n()), tte = max_cycles - cycles) %>%
select(-max_cycles) %>% ungroup
}
add_tte_old <- function(dataset) {
dataset %<>% group_by(engine) %>% mutate(extra = rep(1, times = n()), tte = max(cycles) - cycles) %>%
select(-extra) %>% ungroup
}
microbenchmark(add_tte(foo), add_tte_old(foo), times = 100)
On my machine, these two are pretty similar.在我的机器上,这两个非常相似。
It is kind of ironic that with your way of attempting to pre-compute the max(cycles)
, you probably did what you were trying to avoid :)具有讽刺意味的是,通过您尝试预先计算
max(cycles)
,您可能做了您试图避免的事情:)
In the case here, you should really use the explicit rep()
to fill up the column, whereas in the subtraction max(cycles) - cycles
the auto-recycling is alright.在这种情况下,您应该真正使用显式
rep()
来填充列,而在减法max(cycles) - cycles
,自动回收没问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.