I'm messing around with R and came across some code using the tilde operator and Model.Matrix in a way I can't really quite figure out. I produced a very simple example below.
rm(list=ls())
data("iris")
x = model.matrix(~Species +0, iris)
x
Here's a quick snapshot of the data selecting 3 random rows with different species:
>iris[c(1,78, 143),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
78 6.7 3.0 5.0 1.7 versicolor
143 5.8 2.7 5.1 1.9 virginica
and of the same row but after it's been put through model.matrix:
> x[c(1,78,143),]
Speciessetosa Speciesversicolor Speciesvirginica
1 1 0 0
78 0 1 0
143 0 0 1
As you can see this output output creates a matrix of 3 columns, and 150 rows (the same number of observations). In each row the corresponding species of flower is labeled 1. This is really neat but I can't really understand how or why this is working exactly.
I'm confused on two elements.
model.matrix creates a design (or model) matrix, eg, by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.
I don't really quite follow what this means. What does it mean when it says "expanding factors" to a set of dummy variables?
lm(Y ~ X1+X2-X3)
or the like. I've never seen it used nakedly here.if I just run something similar in the console without the wrapping of model.matrix like so:
> ~iris$Species
I only get this as the output:
~iris$Species
Which honestly seems like an error, but I assume since R doesn't tell me it is an error that it's not actually.
There's a lot here, let's go point by point. The main idea is that model.matrix()
is designed to take the variables in your data set and transform them into a matrix format that is suitable for linear regression (you may be familiar with the linear-algebra expression for regression, y = X %*% beta
).
For simple numerical covariates, the translation is simple — the variable becomes a column in X
. Categorical variables (factors in R), however, have to be transformed to a set of binary variables that will represent differences in expected responses across categories, in a way that is defined by the contrasts associated with the factors.
~Species +0
says to set up dummy variables for the Species
factor: +0
says not to use an intercept. In this special case, the dummy variables created are indicator variables — for a given row, the value in the column corresponding to the species for that observation is 1, the others are 0. Since (for example) the first observation is of "Setosa", Speciessetosa
is 1, the other columns are zero. If you had a vector of coefficients ( beta
) that contained the mean for each species, multiplying this X %*% beta
would pick out the mean corresponding to the species for each observation.
(The R formula language is quite a rabbit hole: it can do useful, complicated stuff if the formula contains factors with different contrasts; interactions; or functions such as poly()
or splines::ns()
that create multi-variable predictors from a single input variable...)
Question 2: in <response> ~ <stuff>
, the <stuff>
contains the input variables we need to define the model matrix. Until we actually fit the regression model, we don't need to know the response variable, so we can use a one-sided formula ~ <stuff>
.
Your last question: using ~
outside of a formula context tells R to keep whatever follows the tilde as an unevaluated expression . For example:
> x <- ~hello
> x
~hello
> x[[2]]
hello
> hello <- 5
> eval(x[[2]])
[1] 5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.