简体   繁体   English

为什么预计算列最大值的 dplyr 代码比在 mutate 调用中计算它的 dplyr 代码慢?

[英]Why dplyr code which precomputes the maximum of a column is slower than dplyr code which computes it inside the mutate call?

Sample data frame:示例数据框:

ngroups <- 100
nsamples <- 1000
foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))

I want to find the max of cycles for each engine group, and create a new variable tte = max(cycles) - cycles with mutate .我想找到每个engine组的max cycles数,并创建一个新变量tte = max(cycles) - cycles带有mutate tte = max(cycles) - cycles I thought that if I would precompute the column of maximum cycles, instead than recomputing it inside the mutate command for each row, the code would be faster.我认为如果我预先计算最大周期的列,而不是在每一行的mutate命令中重新计算它,代码会更快。 Turns out I'm wrong:结果我错了:

library(microbenchmark)
library(dplyr)
library(magrittr)

add_tte <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles)) %>% 
    mutate(tte = max_cycles - cycles) %>% select(-max_cycles) %>% ungroup
}

add_tte_old <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(tte = max(cycles) - cycles) %>% ungroup
}

microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr      min        lq     mean   median       uq       max neval
# add_tte(foo) 17.45324 21.107264 26.50535 24.52625 28.75208 113.98433   500
# add_tte_old(foo)  8.10376  9.949188 13.35830 12.18336 14.52474  77.64578   500

Why is this happening?为什么会这样? Is the reason that dplyr computes the maximum just once for group, instead that once for row?dplyr只为组计算一次最大值,而不是为行计算一次最大值的原因吗?

EDIT : even if I use a single mutate statement in add_tte , and I create a bigger example, add_tte_old is still faster编辑:即使我在add_tte使用单个mutate语句,并且我创建了一个更大的示例, add_tte_old仍然更快

# these are the only lines of code modified, the rest is as before
nsamples <- 10000

foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))

add_tte <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles), tte = max_cycles - cycles) %>%
  select(-max_cycles) %>% ungroup
}

# the new results are:
microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr      min        lq      mean    median        uq      max neval
# add_tte(foo) 90.46658 107.14015 139.13570 131.83689 158.24358 411.3272   500
# add_tte_old(foo) 39.38357  46.13531  62.57386  52.00782  69.26815 176.1512   500

You have made some wrong assumptions, but besides that, more importantly, you are not comparing like-wise.您做出了一些错误的假设,但除此之外,更重要的是,您没有进行类似的比较。

It would make more sense to look at the two variants below:看看下面的两个变体会更有意义:

add_tte <- function(dataset) {
  dataset %<>% group_by(engine) %>% mutate(max_cycles = rep(max(cycles), times = n()), tte = max_cycles - cycles) %>%
    select(-max_cycles) %>% ungroup
}

add_tte_old <- function(dataset) {
  dataset %<>% group_by(engine) %>% mutate(extra = rep(1, times = n()), tte = max(cycles) - cycles) %>%
    select(-extra) %>% ungroup
}

microbenchmark(add_tte(foo), add_tte_old(foo), times = 100)

On my machine, these two are pretty similar.在我的机器上,这两个非常相似。

It is kind of ironic that with your way of attempting to pre-compute the max(cycles) , you probably did what you were trying to avoid :)具有讽刺意味的是,通过您尝试预先计算max(cycles) ,您可能做了您试图避免的事情:)

In the case here, you should really use the explicit rep() to fill up the column, whereas in the subtraction max(cycles) - cycles the auto-recycling is alright.在这种情况下,您应该真正使用显式rep()来填充列,而在减法max(cycles) - cycles ,自动回收没问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM