简体   繁体   中英

Dplyr code is slower than expected when creating new variables with mutate

I am using dplyr to create three new variables on my data frame. The data frame is 84,253 obs. of 164 variables. Below is my code.

# ptm <- proc.time()
 D04_Base2 <- D04_Base %>% 
    mutate(
        birthyr = year(as.Date(BIRTHDT,"%m/%d/%Y")),
        age = (snapshotDt - as.Date(BIRTHDT,"%m/%d/%Y")) / 365.25,
        age = ifelse(age > 100, NA, age)
        )
# proc.time() - ptm
user  system elapsed 
12.34    0.03   12.42 

However, I am wondering if there is a noticeable issue with my code as it is taking much longer than I expected to run or is this something else. As displayed above, it is taking about 12 seconds for the code to complete.

Yes, there are some inefficiencies in your code:

  1. You convert the BIRTHDT column to Date twice. (This is by far the biggest issue.)
  2. base::as.Date isn't super fast
  3. You can use dplyr::if_else instead of base::ifelse for a little bit of performance gain.

Let's do some tests:

library(microbenchmark)
library(dplyr)
library(lubridate)

mbm = microbenchmark::microbenchmark

# generate big-ish sample data
n = 1e5
dates = seq.Date(from = Sys.Date(), length.out = n, by = "day")
raw_dates = format(dates, "%m/%d/%Y")
df = data.frame(x = 1:n)

Date conversion

mbm(
    mdy = mdy(raw_dates),
    base = as.Date(raw_dates, format = "%m/%d/%Y")
)
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval cld
#   mdy 21.39190 27.97036 37.35768 29.50610 31.44242 197.2258   100  a 
#  base 86.75255 92.30122 99.34004 96.78687 99.90462 262.6260   100   b

Looks like lubridate::mdy is 2-3x faster than as.Date at this particular date conversion.

Extracting year

mbm(
    year = year(dates),
    format = format(dates, "%Y")
)
# Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval cld
#    year 29.10152 31.71873 44.84572 33.48525 40.17116 478.8377   100  a 
#  format 77.16788 81.14211 96.42225 83.54550 88.11994 242.7808   100   b

Similarly, lubridate::year (which you already seem to be using) is about 2x faster than base::format for extracting the year.

Adding a column:

mbm(
    base_dollar = {dd = df; dd$y = 1},
    base_bracket = {dd = df; dd[["y"]] = 1},
    mutate = {dd = mutate(df, y = 1)},
    mutate_pipe = {dd = df %>% mutate(y = 1)},
    times = 100L
)
# Unit: microseconds
#          expr     min       lq     mean   median       uq      max neval cld
#   base_dollar 114.834 129.1715 372.8024 146.2275 408.4255 3315.964   100 a  
#  base_bracket 118.585 139.6550 332.1661 156.3530 255.2860 3126.967   100 a  
#        mutate 420.515 466.8320 673.9109 554.4960 745.7175 2821.070   100  b 
#   mutate_pipe 522.402 600.6325 852.2037 715.1110 906.4700 3319.950   100   c

Here we see base do very well. But also notice that these times are in microseconds whereas the above times for the date stuff were in milliseconds . Whether you use base or dplyr to add a column, it's about 1% of the time used to do the date conversions.

ifelse

x = rnorm(1e5)
mbm(
    base_na = ifelse(x > 0, NA, x),
    base_na_real = ifelse(x > 0, NA_real_, x),
    base_replace = replace(x, x > 0, NA_real_),
    dplyr = if_else(x > 0, NA_real_, x),
    units = "ms"
)
# Unit: milliseconds
#          expr      min        lq      mean    median        uq       max neval cld
#       base_na 9.399593 13.399255 18.502441 14.734466 15.998573 138.33834   100  bc
#  base_na_real 8.785988 12.638971 22.885304 14.075802 16.980263 132.18165   100   c
#  base_replace 0.748265  1.136756  2.292686  1.384161  1.802833   9.05869   100 a  
#         dplyr 5.141753  6.875031 14.157227 10.095069 11.561044 124.99218   100  b 

Here the timing is still in milliseconds, but the difference between ifelse and dplyr::if_else isn't so extreme. dplyr::if_else requires that the return vectors are the same type, so we have to specify the NA_real_ for it to work with the numeric output. At Frank's suggestion I threw in base::replace with NA_real_ too, and it is about 10x faster. The lesson here, I think, is "use the simplest function that works".


In summary, dplyr is slower than base at adding a column, but both are super fast compared to everything else that's going on. So it doesn't much matter which column-adding method you use. You can speed up your code by not repeating calculations and by using faster versions of bigger operations. Using what we learned, a more efficient version of your code would be:

library(dplyr)
library(lubridate)
D04_Base2 <- D04_Base %>% 
    mutate(
        birthdate = mdy(BIRTHDT),
        birthyr = year(birthdate),
        age = (snapshotDt - birthdate) / 365.25,
        age = replace(age > 100, NA_real_)
    )

We can ballpark the speed gain on 1e5 rows at about 180 milliseconds as broken out below.

  • 170 ms (single lubridate::mdy at 30 ms instead of two as.Date calls at 100 ms each)
  • 10 ms ( replace rather than ifelse )

The adding a column benchmark suggests that we could save about 0.1 ms by not using the pipe. Since we are adding multiple columns, it's probably more efficient to use dplyr than to add them individually with $<- , but for a single column we could save about 0.5 ms by not using dplyr . Since we've already sped up by 180-ish ms, the potential fraction of a millisecond gained by not using mutate is a rounding error, not an efficiency boost.

In this case, the most complicated thing you're doing is the Date conversion, but even this is likely not your bottleneck if you're doing more processing. To optimize your code you should see which pieces are slow, and work on the slow bits. This is called profiling . In this answer I used microbenchmark to compare competing short methods head-to-head, but other tools (like the lineprof package) are better for identifying the slowest parts of a block of code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM