简体   繁体   中英

R model.matrix and the tilde operator usage?

I'm messing around with R and came across some code using the tilde operator and Model.Matrix in a way I can't really quite figure out. I produced a very simple example below.

rm(list=ls())
data("iris")

x = model.matrix(~Species +0, iris)
x

Here's a quick snapshot of the data selecting 3 random rows with different species:

>iris[c(1,78, 143),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
78           6.7         3.0          5.0         1.7 versicolor
143          5.8         2.7          5.1         1.9  virginica

and of the same row but after it's been put through model.matrix:

> x[c(1,78,143),]
    Speciessetosa Speciesversicolor Speciesvirginica
1               1                 0                0
78              0                 1                0
143             0                 0                1

As you can see this output output creates a matrix of 3 columns, and 150 rows (the same number of observations). In each row the corresponding species of flower is labeled 1. This is really neat but I can't really understand how or why this is working exactly.

I'm confused on two elements.

  1. what exactly is model.matrix doing here. When I bring up the help menu in RStudio it simply model.matrix creates a design (or model) matrix, eg, by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.

I don't really quite follow what this means. What does it mean when it says "expanding factors" to a set of dummy variables?

  1. Also, how is the tilde operator being used here? Usually when I have seen the tilde operator, it separates the Y and the Xs as in lm(Y ~ X1+X2-X3) or the like. I've never seen it used nakedly here.

if I just run something similar in the console without the wrapping of model.matrix like so:

> ~iris$Species

I only get this as the output:

~iris$Species

Which honestly seems like an error, but I assume since R doesn't tell me it is an error that it's not actually.

There's a lot here, let's go point by point. The main idea is that model.matrix() is designed to take the variables in your data set and transform them into a matrix format that is suitable for linear regression (you may be familiar with the linear-algebra expression for regression, y = X %*% beta ).

For simple numerical covariates, the translation is simple — the variable becomes a column in X . Categorical variables (factors in R), however, have to be transformed to a set of binary variables that will represent differences in expected responses across categories, in a way that is defined by the contrasts associated with the factors.

~Species +0 says to set up dummy variables for the Species factor: +0 says not to use an intercept. In this special case, the dummy variables created are indicator variables — for a given row, the value in the column corresponding to the species for that observation is 1, the others are 0. Since (for example) the first observation is of "Setosa", Speciessetosa is 1, the other columns are zero. If you had a vector of coefficients ( beta ) that contained the mean for each species, multiplying this X %*% beta would pick out the mean corresponding to the species for each observation.

(The R formula language is quite a rabbit hole: it can do useful, complicated stuff if the formula contains factors with different contrasts; interactions; or functions such as poly() or splines::ns() that create multi-variable predictors from a single input variable...)

Question 2: in <response> ~ <stuff> , the <stuff> contains the input variables we need to define the model matrix. Until we actually fit the regression model, we don't need to know the response variable, so we can use a one-sided formula ~ <stuff> .

Your last question: using ~ outside of a formula context tells R to keep whatever follows the tilde as an unevaluated expression . For example:

> x <- ~hello
> x
~hello
> x[[2]]
hello
> hello <- 5
> eval(x[[2]])
[1] 5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM