Why is model.matrix so slow?

Question

Consider the following demo data:

set.seed(1)
n <- 1000000
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)

If I build a matrix mt with model.matrix() , it takes forever:

system.time(mt <- model.matrix(~x1+x2+x3))
usuário   sistema decorrido 
 0.916     0.185     1.135

But if I do the same with matrix, it is pretty fast:

system.time(mt2 <- matrix(c(rep(1, n), x1, x2, x3), byrow=FALSE, ncol=4))
 usuário   sistema decorrido 
  0.085     0.021     0.105

Why the difference? Is what makes model.matrix() slow really necessary for lm() and related functions?

Answer 1

Using debugonce(model.matrix.default) , and within this using tracemem(data)

model.matrix.default calls model.frame which returns a data.frame . Within model.matrix.default , this data.frame is copied at least 3 times.

Why does lm use model.matrix --> lm is usually called with a data.frame , list or environment as the data argument. model.frame and returning a data.frame ensures that terms in the formula can be found by subsequent calls to lm , and will reference the same values.

Answer 2

In general, a call to Rprof (in library(utils) which should be on the search() path by default) will illustrate where the overhead in time is coming from in a function call:

Rprof("Rprof.out")
m1 <- model.matrix( ~ x1 + x2 + x3)
Rprof(NULL)
summaryRprof("Rprof.out")

giving

> summaryRprof("Rprof.out")
$by.self
                        self.time self.pct total.time total.pct
"model.matrix.default"       0.12    42.86       0.28    100.00
"na.omit.data.frame"         0.06    21.43       0.14     50.00
"[.data.frame"               0.04    14.29       0.08     28.57
"anyDuplicated.default"      0.04    14.29       0.04     14.29
"as.list.data.frame"         0.02     7.14       0.02      7.14

$by.total
                        total.time total.pct self.time self.pct
"model.matrix.default"        0.28    100.00      0.12    42.86
"model.matrix"                0.28    100.00      0.00     0.00
"na.omit.data.frame"          0.14     50.00      0.06    21.43
"model.frame"                 0.14     50.00      0.00     0.00
"model.frame.default"         0.14     50.00      0.00     0.00
"na.omit"                     0.14     50.00      0.00     0.00
"[.data.frame"                0.08     28.57      0.04    14.29
"["                           0.08     28.57      0.00     0.00
"anyDuplicated.default"       0.04     14.29      0.04    14.29
"anyDuplicated"               0.04     14.29      0.00     0.00
"as.list.data.frame"          0.02      7.14      0.02     7.14
"as.list"                     0.02      7.14      0.00     0.00
"vapply"                      0.02      7.14      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 0.28

Thus a large amount of time is spent on checking for NA s with na.omit.data.frame and subsetting the data.frame with [.data.frame , within model.frame.default . The proportions of time will vary depending on the sample size n , but will tend towards a limit for large sample sizes.

Why is model.matrix so slow?

Question

2 answers

solution1
3 ACCPTED 2014-02-12 00:19:00

solution2
1 2014-02-12 03:40:59

Why is model.matrix so slow?

Question

2 answers

solution1 3 ACCPTED 2014-02-12 00:19:00

solution2 1 2014-02-12 03:40:59

solution1
3 ACCPTED 2014-02-12 00:19:00

solution2
1 2014-02-12 03:40:59