[英]Dplyr code is slower than expected when creating new variables with mutate
I am using dplyr
to create three new variables on my data frame. 我正在使用dplyr
在我的数据框架上创建三个新变量。 The data frame is 84,253 obs. 数据框是84,253 obs。 of 164 variables. 164个变量。 Below is my code. 以下是我的代码。
# ptm <- proc.time()
D04_Base2 <- D04_Base %>%
mutate(
birthyr = year(as.Date(BIRTHDT,"%m/%d/%Y")),
age = (snapshotDt - as.Date(BIRTHDT,"%m/%d/%Y")) / 365.25,
age = ifelse(age > 100, NA, age)
)
# proc.time() - ptm
user system elapsed
12.34 0.03 12.42
However, I am wondering if there is a noticeable issue with my code as it is taking much longer than I expected to run or is this something else. 但是,我想知道我的代码是否存在明显的问题,因为它花费的时间比我预期的要长得多,或者这是其他的东西。 As displayed above, it is taking about 12 seconds for the code to complete. 如上所示,代码完成大约需要12秒。
Yes, there are some inefficiencies in your code: 是的,您的代码中存在一些效率低下的问题:
BIRTHDT
column to Date
twice. 您将BIRTHDT
列转换为Date
两次。 (This is by far the biggest issue.) (这是迄今为止最大的问题。) base::as.Date
isn't super fast base::as.Date
不是超级快 dplyr::if_else
instead of base::ifelse
for a little bit of performance gain. 您可以使用dplyr::if_else
而不是base::ifelse
来获得一点性能提升。 Let's do some tests: 我们来做一些测试:
library(microbenchmark)
library(dplyr)
library(lubridate)
mbm = microbenchmark::microbenchmark
# generate big-ish sample data
n = 1e5
dates = seq.Date(from = Sys.Date(), length.out = n, by = "day")
raw_dates = format(dates, "%m/%d/%Y")
df = data.frame(x = 1:n)
mbm(
mdy = mdy(raw_dates),
base = as.Date(raw_dates, format = "%m/%d/%Y")
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# mdy 21.39190 27.97036 37.35768 29.50610 31.44242 197.2258 100 a
# base 86.75255 92.30122 99.34004 96.78687 99.90462 262.6260 100 b
Looks like lubridate::mdy
is 2-3x faster than as.Date
at this particular date conversion. 在这个特定的日期转换中,看起来像lubridate::mdy
比as.Date
快2-3倍。
mbm(
year = year(dates),
format = format(dates, "%Y")
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# year 29.10152 31.71873 44.84572 33.48525 40.17116 478.8377 100 a
# format 77.16788 81.14211 96.42225 83.54550 88.11994 242.7808 100 b
Similarly, lubridate::year
(which you already seem to be using) is about 2x faster than base::format
for extracting the year. 同样, lubridate::year
(你似乎已经使用过)比base::format
提取年份快2倍。
mbm(
base_dollar = {dd = df; dd$y = 1},
base_bracket = {dd = df; dd[["y"]] = 1},
mutate = {dd = mutate(df, y = 1)},
mutate_pipe = {dd = df %>% mutate(y = 1)},
times = 100L
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# base_dollar 114.834 129.1715 372.8024 146.2275 408.4255 3315.964 100 a
# base_bracket 118.585 139.6550 332.1661 156.3530 255.2860 3126.967 100 a
# mutate 420.515 466.8320 673.9109 554.4960 745.7175 2821.070 100 b
# mutate_pipe 522.402 600.6325 852.2037 715.1110 906.4700 3319.950 100 c
Here we see base do very well. 在这里,我们看到基地做得很好。 But also notice that these times are in microseconds whereas the above times for the date stuff were in milliseconds . 但是请注意,这些时间是以微秒为单位,而日期的上述时间是以毫秒为单位 。 Whether you use base
or dplyr
to add a column, it's about 1% of the time used to do the date conversions. 无论您使用base
还是dplyr
添加列,它大约是用于执行日期转换的时间的1%。
x = rnorm(1e5)
mbm(
base_na = ifelse(x > 0, NA, x),
base_na_real = ifelse(x > 0, NA_real_, x),
base_replace = replace(x, x > 0, NA_real_),
dplyr = if_else(x > 0, NA_real_, x),
units = "ms"
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# base_na 9.399593 13.399255 18.502441 14.734466 15.998573 138.33834 100 bc
# base_na_real 8.785988 12.638971 22.885304 14.075802 16.980263 132.18165 100 c
# base_replace 0.748265 1.136756 2.292686 1.384161 1.802833 9.05869 100 a
# dplyr 5.141753 6.875031 14.157227 10.095069 11.561044 124.99218 100 b
Here the timing is still in milliseconds, but the difference between ifelse
and dplyr::if_else
isn't so extreme. 这里的时间仍然是毫秒,但ifelse
和dplyr::if_else
之间的差异并不是那么极端。 dplyr::if_else
requires that the return vectors are the same type, so we have to specify the NA_real_
for it to work with the numeric output. dplyr::if_else
要求返回向量是相同的类型,因此我们必须指定NA_real_
才能使用数字输出。 At Frank's suggestion I threw in base::replace
with NA_real_
too, and it is about 10x faster. 在弗兰克的建议下,我也用NA_real_
base::replace
了base::replace
,它快了大约10倍。 The lesson here, I think, is "use the simplest function that works". 我认为,这里的教训是“使用最简单的功能”。
In summary, dplyr
is slower than base
at adding a column, but both are super fast compared to everything else that's going on. 总之,在添加列时, dplyr
比base
更慢,但与正在进行的其他操作相比,两者都非常快。 So it doesn't much matter which column-adding method you use. 因此,使用哪种列添加方法并不重要。 You can speed up your code by not repeating calculations and by using faster versions of bigger operations. 您可以通过不重复计算和使用更快版本的更大操作来加速代码。 Using what we learned, a more efficient version of your code would be: 使用我们学到的东西,更有效的代码版本是:
library(dplyr)
library(lubridate)
D04_Base2 <- D04_Base %>%
mutate(
birthdate = mdy(BIRTHDT),
birthyr = year(birthdate),
age = (snapshotDt - birthdate) / 365.25,
age = replace(age > 100, NA_real_)
)
We can ballpark the speed gain on 1e5 rows at about 180 milliseconds as broken out below. 我们可以在大约180毫秒的时间内将1e5行的速度增益调整,如下所示。
lubridate::mdy
at 30 ms instead of two as.Date
calls at 100 ms each) 170毫秒(单个lubridate::mdy
在30毫秒而不是两个as.Date
调用每个100毫秒) replace
rather than ifelse
) 10毫秒( replace
而不是ifelse
) The adding a column benchmark suggests that we could save about 0.1 ms by not using the pipe. 添加列基准测试表明,我们可以通过不使用管道节省大约0.1毫秒。 Since we are adding multiple columns, it's probably more efficient to use dplyr
than to add them individually with $<-
, but for a single column we could save about 0.5 ms by not using dplyr
. 由于我们要添加多个列,因此使用dplyr
可能比使用$<-
单独添加它们更有效,但对于单个列,我们可以通过不使用dplyr
节省大约0.5 ms。 Since we've already sped up by 180-ish ms, the potential fraction of a millisecond gained by not using mutate
is a rounding error, not an efficiency boost. 由于我们已经加速了180毫秒,因此不使用mutate
所获得的潜在分数毫秒是舍入误差,而不是效率提升。
In this case, the most complicated thing you're doing is the Date
conversion, but even this is likely not your bottleneck if you're doing more processing. 在这种情况下,您正在做的最复杂的事情是Date
转换,但如果您正在进行更多处理,即使这可能不是您的瓶颈 。 To optimize your code you should see which pieces are slow, and work on the slow bits. 要优化代码,您应该看到哪些部分很慢,并且处理慢速位。 This is called profiling . 这称为分析 。 In this answer I used microbenchmark
to compare competing short methods head-to-head, but other tools (like the lineprof
package) are better for identifying the slowest parts of a block of code. 在这个答案中,我使用microbenchmark
来比较竞争的短方法,但其他工具(如lineprof
包)更适合识别代码块中最慢的部分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.