简体   繁体   中英

Predicting how many columns in R's model.matrix

Is there a way to predict using a formula how many columns are going to be in a model.matrix, without having to instantiate a model.matrix?

I am trying to optimize the code for building a sparse.model.matrix :

The memory allocation to construct a sparse mm is inefficient in this function. Because it does not know how many columns there will be in the end matrix, it cannot do 1 single memory allocation for a big matrix. Instead, it will for loop over the terms in the formula, and allocate many smaller matrices. During each iteration of the for loop, it will also cbind the matrices together to grow the big output matrix, which is generating so many memory allocations and is really slow for large data.

If there is a way to calculate how many columns the end result will need, we could preallocate the matrix and make sparse.model.matrix much more efficient.

The challenge to me is I do not know how to compute how many columns will be needed for interaction terms, especially if there are interactions in the form a:b:c. Also, I do not have experience with contrasts, so do not know how that effects the number of columns needed

Here is a small example:

> set.seed(100)
> col_x1 = as.factor(sample(LETTERS[1:5], 10, replace = TRUE))
> col_x2 = as.factor(sample(LETTERS[1:10], 10, replace = TRUE))
> col_x3 = as.factor(sample(LETTERS[1:2], 10, replace = TRUE))
> df <- data.frame(X1 = col_x1, X2 = col_x2, X3 = col_x3)
> df
   X1 X2 X3
1   B  G  B
2   B  I  B
3   C  C  B
4   A  D  B
5   C  H  A
6   C  G  A
7   E  C  B
8   B  D  B
9   C  D  B
10  A  G  A
> str(df)
'data.frame':   10 obs. of  3 variables:
 $ X1: Factor w/ 4 levels "A","B","C","E": 2 2 3 1 3 3 4 2 3 1
 $ X2: Factor w/ 5 levels "C","D","G","H",..: 3 5 1 2 4 3 1 2 2 3
 $ X3: Factor w/ 2 levels "A","B": 2 2 2 2 1 1 2 2 2 1
> df_model_matrix <- model.matrix(~., df)
> dim(df_model_matrix)
[1] 10  9
> df_model_matrix <- model.matrix(~ X1 + X2 + X3 + X1*X2 + X2*X3 + X3*X1, df)
> dim(df_model_matrix)
[1] 10 28
> df_model_matrix <- model.matrix(~ X1 + X2 + X3 + X1*X2 + X2*X3 + X3*X1 + X1*X2*X3, df)
> dim(df_model_matrix)
[1] 10 40

In this case, formula you are looking for is:

model.matrix中的列数公式

If you are using model.matrix in a very specific way (supplying your contrasts, suppressing intercept etc.) then you need to modify it accordingly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM