使用mutate创建新变量时，Dplyr代码比预期慢

Question

I am using dplyr to create three new variables on my data frame. 我正在使用dplyr在我的数据框架上创建三个新变量。 The data frame is 84,253 obs. 数据框是84,253 obs。 of 164 variables. 164个变量。 Below is my code. 以下是我的代码。

# ptm <- proc.time()
 D04_Base2 <- D04_Base %>% 
    mutate(
        birthyr = year(as.Date(BIRTHDT,"%m/%d/%Y")),
        age = (snapshotDt - as.Date(BIRTHDT,"%m/%d/%Y")) / 365.25,
        age = ifelse(age > 100, NA, age)
        )
# proc.time() - ptm
user  system elapsed 
12.34    0.03   12.42

However, I am wondering if there is a noticeable issue with my code as it is taking much longer than I expected to run or is this something else. 但是，我想知道我的代码是否存在明显的问题，因为它花费的时间比我预期的要长得多，或者这是其他的东西。 As displayed above, it is taking about 12 seconds for the code to complete. 如上所示，代码完成大约需要12秒。

Answer 1

Yes, there are some inefficiencies in your code: 是的，您的代码中存在一些效率低下的问题：

You convert the BIRTHDT column to Date twice. 您将BIRTHDT列转换为Date两次。 (This is by far the biggest issue.) （这是迄今为止最大的问题。）
base::as.Date isn't super fast base::as.Date不是超级快
You can use dplyr::if_else instead of base::ifelse for a little bit of performance gain. 您可以使用dplyr::if_else而不是base::ifelse来获得一点性能提升。

Let's do some tests: 我们来做一些测试：

library(microbenchmark)
library(dplyr)
library(lubridate)

mbm = microbenchmark::microbenchmark

# generate big-ish sample data
n = 1e5
dates = seq.Date(from = Sys.Date(), length.out = n, by = "day")
raw_dates = format(dates, "%m/%d/%Y")
df = data.frame(x = 1:n)

Date conversion 日期转换

mbm(
    mdy = mdy(raw_dates),
    base = as.Date(raw_dates, format = "%m/%d/%Y")
)
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval cld
#   mdy 21.39190 27.97036 37.35768 29.50610 31.44242 197.2258   100  a 
#  base 86.75255 92.30122 99.34004 96.78687 99.90462 262.6260   100   b

Looks like lubridate::mdy is 2-3x faster than as.Date at this particular date conversion. 在这个特定的日期转换中，看起来像lubridate::mdy比as.Date快2-3倍。

Extracting year 提取年份

mbm(
    year = year(dates),
    format = format(dates, "%Y")
)
# Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval cld
#    year 29.10152 31.71873 44.84572 33.48525 40.17116 478.8377   100  a 
#  format 77.16788 81.14211 96.42225 83.54550 88.11994 242.7808   100   b

Similarly, lubridate::year (which you already seem to be using) is about 2x faster than base::format for extracting the year. 同样， lubridate::year （你似乎已经使用过）比base::format提取年份快2倍。

Adding a column: 添加列：

mbm(
    base_dollar = {dd = df; dd$y = 1},
    base_bracket = {dd = df; dd[["y"]] = 1},
    mutate = {dd = mutate(df, y = 1)},
    mutate_pipe = {dd = df %>% mutate(y = 1)},
    times = 100L
)
# Unit: microseconds
#          expr     min       lq     mean   median       uq      max neval cld
#   base_dollar 114.834 129.1715 372.8024 146.2275 408.4255 3315.964   100 a  
#  base_bracket 118.585 139.6550 332.1661 156.3530 255.2860 3126.967   100 a  
#        mutate 420.515 466.8320 673.9109 554.4960 745.7175 2821.070   100  b 
#   mutate_pipe 522.402 600.6325 852.2037 715.1110 906.4700 3319.950   100   c

Here we see base do very well. 在这里，我们看到基地做得很好。 But also notice that these times are in microseconds whereas the above times for the date stuff were in milliseconds . 但是请注意，这些时间是以微秒为单位，而日期的上述时间是以毫秒为单位 。 Whether you use base or dplyr to add a column, it's about 1% of the time used to do the date conversions. 无论您使用base还是dplyr添加列，它大约是用于执行日期转换的时间的1％。

ifelse 如果别的

x = rnorm(1e5)
mbm(
    base_na = ifelse(x > 0, NA, x),
    base_na_real = ifelse(x > 0, NA_real_, x),
    base_replace = replace(x, x > 0, NA_real_),
    dplyr = if_else(x > 0, NA_real_, x),
    units = "ms"
)
# Unit: milliseconds
#          expr      min        lq      mean    median        uq       max neval cld
#       base_na 9.399593 13.399255 18.502441 14.734466 15.998573 138.33834   100  bc
#  base_na_real 8.785988 12.638971 22.885304 14.075802 16.980263 132.18165   100   c
#  base_replace 0.748265  1.136756  2.292686  1.384161  1.802833   9.05869   100 a  
#         dplyr 5.141753  6.875031 14.157227 10.095069 11.561044 124.99218   100  b

Here the timing is still in milliseconds, but the difference between ifelse and dplyr::if_else isn't so extreme. 这里的时间仍然是毫秒，但ifelse和dplyr::if_else之间的差异并不是那么极端。 dplyr::if_else requires that the return vectors are the same type, so we have to specify the NA_real_ for it to work with the numeric output. dplyr::if_else要求返回向量是相同的类型，因此我们必须指定NA_real_才能使用数字输出。 At Frank's suggestion I threw in base::replace with NA_real_ too, and it is about 10x faster. 在弗兰克的建议下，我也用NA_real_ base::replace了base::replace ，它快了大约10倍。 The lesson here, I think, is "use the simplest function that works". 我认为，这里的教训是“使用最简单的功能”。

In summary, dplyr is slower than base at adding a column, but both are super fast compared to everything else that's going on. 总之，在添加列时， dplyr比base更慢，但与正在进行的其他操作相比，两者都非常快。 So it doesn't much matter which column-adding method you use. 因此，使用哪种列添加方法并不重要。 You can speed up your code by not repeating calculations and by using faster versions of bigger operations. 您可以通过不重复计算和使用更快版本的更大操作来加速代码。 Using what we learned, a more efficient version of your code would be: 使用我们学到的东西，更有效的代码版本是：

library(dplyr)
library(lubridate)
D04_Base2 <- D04_Base %>% 
    mutate(
        birthdate = mdy(BIRTHDT),
        birthyr = year(birthdate),
        age = (snapshotDt - birthdate) / 365.25,
        age = replace(age > 100, NA_real_)
    )

We can ballpark the speed gain on 1e5 rows at about 180 milliseconds as broken out below. 我们可以在大约180毫秒的时间内将1e5行的速度增益调整，如下所示。

170 ms (single lubridate::mdy at 30 ms instead of two as.Date calls at 100 ms each) 170毫秒（单个lubridate::mdy在30毫秒而不是两个as.Date调用每个100毫秒）
10 ms ( replace rather than ifelse ) 10毫秒（ replace而不是ifelse ）

The adding a column benchmark suggests that we could save about 0.1 ms by not using the pipe. 添加列基准测试表明，我们可以通过不使用管道节省大约0.1毫秒。 Since we are adding multiple columns, it's probably more efficient to use dplyr than to add them individually with $<- , but for a single column we could save about 0.5 ms by not using dplyr . 由于我们要添加多个列，因此使用dplyr可能比使用$<-单独添加它们更有效，但对于单个列，我们可以通过不使用dplyr节省大约0.5 ms。 Since we've already sped up by 180-ish ms, the potential fraction of a millisecond gained by not using mutate is a rounding error, not an efficiency boost. 由于我们已经加速了180毫秒，因此不使用mutate所获得的潜在分数毫秒是舍入误差，而不是效率提升。

In this case, the most complicated thing you're doing is the Date conversion, but even this is likely not your bottleneck if you're doing more processing. 在这种情况下，您正在做的最复杂的事情是Date转换，但如果您正在进行更多处理，即使这可能不是您的瓶颈 。 To optimize your code you should see which pieces are slow, and work on the slow bits. 要优化代码，您应该看到哪些部分很慢，并且处理慢速位。 This is called profiling . 这称为分析。 In this answer I used microbenchmark to compare competing short methods head-to-head, but other tools (like the lineprof package) are better for identifying the slowest parts of a block of code. 在这个答案中，我使用microbenchmark来比较竞争的短方法，但其他工具（如lineprof包）更适合识别代码块中最慢的部分。

使用mutate创建新变量时，Dplyr代码比预期慢

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-02-28 23:10:46

Date conversion 日期转换

Extracting year 提取年份

Adding a column: 添加列：

ifelse 如果别的

使用mutate创建新变量时，Dplyr代码比预期慢

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-02-28 23:10:46

Date conversion 日期转换

Extracting year 提取年份

Adding a column: 添加列：

ifelse 如果别的

解决方案1
4 已采纳 2017-02-28 23:10:46