简体   繁体   中英

R model.matrix column names for factors

I use model.matrix to create a matrix used by GLM.

formula_test <- as.formula("Y ~ x1 + x2")
data_test <- expand.grid(
  Y = 1:100
  , x1 = c("A","B")
  , x2 = 1:20
)
result_test <- data.frame(model.matrix(
  object = formula_test
  , data = data_test
))
names(result_test)

Interestingly, the column names of the result_test data are "X.Intercept." "x1B" "x2" "X.Intercept." "x1B" "x2"

How come the second column name is not "x1A" ?

I then tried data_test$x1 <- factor(x = data_test$x1, levels = c("A","B")) but it's still the same.

That is because if you had c("X.Intercept.", "x1A", "x1B", "x2") , then you would have perfect multicollinearity: x1A + x1B would be a column of ones, just like the X.Intercept. column. If, for the sake of interpretation, you prefer having x1A instead of the intercept, we may use

formula_test <- as.formula("Y ~ -1 + x1 + x2")

giving

names(result_test)
# [1] "x1A" "x1B" "x2" 

and

all(rowSums(result_test[, c("x1A", "x1B")]) == 1)
# [1] TRUE

As for why it is x1A that is dropped rather than x1B , the rule seems to be that the first factor levels goes away. If instead we use

levels(data_test$x1) <- c("B", "A")

then this gives

names(result_test)
# [1] "X.Intercept." "x1A"          "x2"  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM