R model.matrix column names for factors

Question

I use model.matrix to create a matrix used by GLM.

formula_test <- as.formula("Y ~ x1 + x2")
data_test <- expand.grid(
  Y = 1:100
  , x1 = c("A","B")
  , x2 = 1:20
)
result_test <- data.frame(model.matrix(
  object = formula_test
  , data = data_test
))
names(result_test)

Interestingly, the column names of the result_test data are "X.Intercept." "x1B" "x2" "X.Intercept." "x1B" "x2"

How come the second column name is not "x1A" ?

I then tried data_test$x1 <- factor(x = data_test$x1, levels = c("A","B")) but it's still the same.

Answer 1

That is because if you had c("X.Intercept.", "x1A", "x1B", "x2") , then you would have perfect multicollinearity: x1A + x1B would be a column of ones, just like the X.Intercept. column. If, for the sake of interpretation, you prefer having x1A instead of the intercept, we may use

formula_test <- as.formula("Y ~ -1 + x1 + x2")

giving

names(result_test)
# [1] "x1A" "x1B" "x2"

and

all(rowSums(result_test[, c("x1A", "x1B")]) == 1)
# [1] TRUE

As for why it is x1A that is dropped rather than x1B , the rule seems to be that the first factor levels goes away. If instead we use

levels(data_test$x1) <- c("B", "A")

then this gives

names(result_test)
# [1] "X.Intercept." "x1A"          "x2"

R model.matrix column names for factors

Question

1 answers

solution1
1 ACCPTED 2019-01-17 01:25:13

R model.matrix column names for factors

Question

1 answers

solution1 1 ACCPTED 2019-01-17 01:25:13

solution1
1 ACCPTED 2019-01-17 01:25:13