Consider the following demo data:
set.seed(1)
n <- 1000000
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
If I build a matrix mt
with model.matrix()
, it takes forever:
system.time(mt <- model.matrix(~x1+x2+x3))
usuário sistema decorrido
0.916 0.185 1.135
But if I do the same with matrix, it is pretty fast:
system.time(mt2 <- matrix(c(rep(1, n), x1, x2, x3), byrow=FALSE, ncol=4))
usuário sistema decorrido
0.085 0.021 0.105
Why the difference? Is what makes model.matrix()
slow really necessary for lm()
and related functions?
Using debugonce(model.matrix.default)
, and within this using tracemem(data)
model.matrix.default
calls model.frame
which returns a data.frame
. Within model.matrix.default
, this data.frame is copied at least 3 times.
Why does lm
use model.matrix
--> lm
is usually called with a data.frame
, list
or environment
as the data
argument. model.frame
and returning a data.frame
ensures that terms in the formula can be found by subsequent calls to lm
, and will reference the same values.
In general, a call to Rprof
(in library(utils)
which should be on the search()
path by default) will illustrate where the overhead in time is coming from in a function call:
Rprof("Rprof.out")
m1 <- model.matrix( ~ x1 + x2 + x3)
Rprof(NULL)
summaryRprof("Rprof.out")
giving
> summaryRprof("Rprof.out")
$by.self
self.time self.pct total.time total.pct
"model.matrix.default" 0.12 42.86 0.28 100.00
"na.omit.data.frame" 0.06 21.43 0.14 50.00
"[.data.frame" 0.04 14.29 0.08 28.57
"anyDuplicated.default" 0.04 14.29 0.04 14.29
"as.list.data.frame" 0.02 7.14 0.02 7.14
$by.total
total.time total.pct self.time self.pct
"model.matrix.default" 0.28 100.00 0.12 42.86
"model.matrix" 0.28 100.00 0.00 0.00
"na.omit.data.frame" 0.14 50.00 0.06 21.43
"model.frame" 0.14 50.00 0.00 0.00
"model.frame.default" 0.14 50.00 0.00 0.00
"na.omit" 0.14 50.00 0.00 0.00
"[.data.frame" 0.08 28.57 0.04 14.29
"[" 0.08 28.57 0.00 0.00
"anyDuplicated.default" 0.04 14.29 0.04 14.29
"anyDuplicated" 0.04 14.29 0.00 0.00
"as.list.data.frame" 0.02 7.14 0.02 7.14
"as.list" 0.02 7.14 0.00 0.00
"vapply" 0.02 7.14 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 0.28
Thus a large amount of time is spent on checking for NA
s with na.omit.data.frame
and subsetting the data.frame
with [.data.frame
, within model.frame.default
. The proportions of time will vary depending on the sample size n
, but will tend towards a limit for large sample sizes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.