为什么预计算列最大值的 dplyr 代码比在 mutate 调用中计算它的 dplyr 代码慢？

Question

Sample data frame:示例数据框：

ngroups <- 100
nsamples <- 1000
foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))

I want to find the max of cycles for each engine group, and create a new variable tte = max(cycles) - cycles with mutate .我想找到每个engine组的max cycles数，并创建一个新变量tte = max(cycles) - cycles带有mutate tte = max(cycles) - cycles 。 I thought that if I would precompute the column of maximum cycles, instead than recomputing it inside the mutate command for each row, the code would be faster.我认为如果我预先计算最大周期的列，而不是在每一行的mutate命令中重新计算它，代码会更快。 Turns out I'm wrong:结果我错了：

library(microbenchmark)
library(dplyr)
library(magrittr)

add_tte <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles)) %>% 
    mutate(tte = max_cycles - cycles) %>% select(-max_cycles) %>% ungroup
}

add_tte_old <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(tte = max(cycles) - cycles) %>% ungroup
}

microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr      min        lq     mean   median       uq       max neval
# add_tte(foo) 17.45324 21.107264 26.50535 24.52625 28.75208 113.98433   500
# add_tte_old(foo)  8.10376  9.949188 13.35830 12.18336 14.52474  77.64578   500

Why is this happening?为什么会这样？ Is the reason that dplyr computes the maximum just once for group, instead that once for row?是dplyr只为组计算一次最大值，而不是为行计算一次最大值的原因吗？

EDIT : even if I use a single mutate statement in add_tte , and I create a bigger example, add_tte_old is still faster编辑：即使我在add_tte使用单个mutate语句，并且我创建了一个更大的示例， add_tte_old仍然更快

# these are the only lines of code modified, the rest is as before
nsamples <- 10000

foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))

add_tte <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles), tte = max_cycles - cycles) %>%
  select(-max_cycles) %>% ungroup
}

# the new results are:
microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr      min        lq      mean    median        uq      max neval
# add_tte(foo) 90.46658 107.14015 139.13570 131.83689 158.24358 411.3272   500
# add_tte_old(foo) 39.38357  46.13531  62.57386  52.00782  69.26815 176.1512   500

Answer 1

You have made some wrong assumptions, but besides that, more importantly, you are not comparing like-wise.您做出了一些错误的假设，但除此之外，更重要的是，您没有进行类似的比较。

It would make more sense to look at the two variants below:看看下面的两个变体会更有意义：

add_tte <- function(dataset) {
  dataset %<>% group_by(engine) %>% mutate(max_cycles = rep(max(cycles), times = n()), tte = max_cycles - cycles) %>%
    select(-max_cycles) %>% ungroup
}

add_tte_old <- function(dataset) {
  dataset %<>% group_by(engine) %>% mutate(extra = rep(1, times = n()), tte = max(cycles) - cycles) %>%
    select(-extra) %>% ungroup
}

microbenchmark(add_tte(foo), add_tte_old(foo), times = 100)

On my machine, these two are pretty similar.在我的机器上，这两个非常相似。

It is kind of ironic that with your way of attempting to pre-compute the max(cycles) , you probably did what you were trying to avoid :)具有讽刺意味的是，通过您尝试预先计算max(cycles) ，您可能做了您试图避免的事情:)

In the case here, you should really use the explicit rep() to fill up the column, whereas in the subtraction max(cycles) - cycles the auto-recycling is alright.在这种情况下，您应该真正使用显式rep()来填充列，而在减法max(cycles) - cycles ，自动回收没问题。

为什么预计算列最大值的 dplyr 代码比在 mutate 调用中计算它的 dplyr 代码慢？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-02-15 15:39:18

为什么预计算列最大值的 dplyr 代码比在 mutate 调用中计算它的 dplyr 代码慢？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-02-15 15:39:18

解决方案1
1 已采纳 2018-02-15 15:39:18