Why is dplyr so slow?

Question

Like most people, I'm impressed by Hadley Wickham and what he's done for R -- so i figured that i'd move some functions toward his tidyverse ... having done so i'm left wondering what the point of it all is?

My new dplyr functions are much slower than their base equivalents -- i hope i'm doing something wrong. I'd particularly like some payoff from the effort required to understand non-standard-evaluation .

So, what am i doing wrong? Why is dplyr so slow?

An example:

require(microbenchmark)
require(dplyr)

df <- tibble(
             a = 1:10,
             b = c(1:5, 4:0),
             c = 10:1)

addSpread_base <- function() {
    df[['spread']] <- df[['a']] - df[['b']]
    df
}

addSpread_dplyr <- function() df %>% mutate(spread := a - b)

all.equal(addSpread_base(), addSpread_dplyr())

microbenchmark(addSpread_base(), addSpread_dplyr(), times = 1e4)

Timing results:

Unit: microseconds
              expr     min      lq      mean median      uq       max neval
  addSpread_base()  12.058  15.769  22.07805  24.58  26.435  2003.481 10000
 addSpread_dplyr() 607.537 624.697 666.08964 631.19 636.291 41143.691 10000

So using dplyr functions to transform the data takes about 30x longer -- surely this isn't the intention?

I figured that perhaps this is too easy a case -- and that dplyr would really shine if we had a more realistic case where we are adding a column and sub-setting the data -- but this was worse. As you can see from the timings below, this is ~70x slower than the base approach.

# mutate and substitute
addSpreadSub_base <- function(df, col1, col2) {
    df[['spread']] <- df[['a']] - df[['b']]
    df[, c(col1, col2, 'spread')]
}

addSpreadSub_dplyr <- function(df, col1, col2) {
    var1 <- as.name(col1)
    var2 <- as.name(col2)
    qq <- quo(!!var1 - !!var2)
    df %>% 
        mutate(spread := !!qq) %>% 
        select(!!var1, !!var2, spread)
}

all.equal(addSpreadSub_base(df, col1 = 'a', col2 = 'b'), 
          addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'))

microbenchmark(addSpreadSub_base(df, col1 = 'a', col2 = 'b'), 
               addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'), 
               times = 1e4)

Results:

Unit: microseconds
                                           expr      min       lq      mean   median       uq      max neval
  addSpreadSub_base(df, col1 = "a", col2 = "b")   22.725   30.610   44.3874   45.450   53.798  2024.35 10000
 addSpreadSub_dplyr(df, col1 = "a", col2 = "b") 2748.757 2837.337 3011.1982 2859.598 2904.583 44207.81 10000

Answer 1

These are micro seconds, your dataset has 10 rows, unless you plan on looping on millions of datasets of 10 rows your benchmark is pretty much irrelevant (and in that case I can't imagine a situation where it wouldn't be wise to bind them together as a first step).

Let's do it with a bigger dataset, like 1 million times bigger :

df <- tibble(
  a = 1:10,
  b = c(1:5, 4:0),
  c = 10:1)

df2 <- bind_rows(replicate(1000000,df,F))

addSpread_base <- function(df) {
  df[['spread']] <- df[['a']] - df[['b']]
  df
}
addSpread_dplyr  <- function(df) df %>% mutate(spread = a - b)

microbenchmark::microbenchmark(
  addSpread_base(df2), 
  addSpread_dplyr(df2),
  times = 100)
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval cld
# addSpread_base(df2) 25.85584 26.93562 37.77010 32.33633 35.67604 170.6507   100   a
# addSpread_dplyr(df2) 26.91690 27.57090 38.98758 33.39769 39.79501 182.2847   100   a

Still quite fast and not much difference.

As for the "whys" of the result that you got, it's because you're using a much more complex function, so it has overheads.

Commenters have pointed that dplyr doesn't try too hard to be fast and maybe it's true when you compare to data.table , and interface is the first concern, but the authors have been working hard on speed as well. Hybrid evaluation for example allows (if I got it right) to execute C code directly on grouped data when aggregating with common functions, which can be much faster than base code, but simple code will always run faster with simple functions.

Why is dplyr so slow?

Question

1 answers

solution1
7 2019-01-23 13:18:23

Why is dplyr so slow?

Question

1 answers

solution1 7 2019-01-23 13:18:23

solution1
7 2019-01-23 13:18:23