简体   繁体   English

R model.matrix 和波浪号运算符的用法?

[英]R model.matrix and the tilde operator usage?

I'm messing around with R and came across some code using the tilde operator and Model.Matrix in a way I can't really quite figure out.我在搞乱 R 并遇到了一些使用波浪号运算符和 Model.Matrix 的代码,我不太明白。 I produced a very simple example below.我在下面制作了一个非常简单的示例。

rm(list=ls())
data("iris")

x = model.matrix(~Species +0, iris)
x

Here's a quick snapshot of the data selecting 3 random rows with different species:这是选择具有不同物种的 3 个随机行的数据的快速快照:

>iris[c(1,78, 143),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
78           6.7         3.0          5.0         1.7 versicolor
143          5.8         2.7          5.1         1.9  virginica

and of the same row but after it's been put through model.matrix:并且在同一行,但在通过 model.matrix 之后:

> x[c(1,78,143),]
    Speciessetosa Speciesversicolor Speciesvirginica
1               1                 0                0
78              0                 1                0
143             0                 0                1

As you can see this output output creates a matrix of 3 columns, and 150 rows (the same number of observations).如您所见,此 output output 创建了一个 3 列和 150 行的矩阵(相同数量的观察值)。 In each row the corresponding species of flower is labeled 1. This is really neat but I can't really understand how or why this is working exactly.在每一行中,相应的花种都标记为 1。这真的很整洁,但我无法真正理解它是如何或为什么会准确工作的。

I'm confused on two elements.我对两个要素感到困惑。

  1. what exactly is model.matrix doing here. model.matrix 到底在做什么。 When I bring up the help menu in RStudio it simply model.matrix creates a design (or model) matrix, eg, by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.当我在 RStudio 中调出帮助菜单时,它只是model.matrix creates a design (or model) matrix, eg, by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.

I don't really quite follow what this means.我不太明白这意味着什么。 What does it mean when it says "expanding factors" to a set of dummy variables?当它对一组虚拟变量说“扩展因子”时是什么意思?

  1. Also, how is the tilde operator being used here?另外,这里如何使用波浪号运算符? Usually when I have seen the tilde operator, it separates the Y and the Xs as in lm(Y ~ X1+X2-X3) or the like.通常,当我看到波浪号运算符时,它将 Y 和 Xs 分开,如lm(Y ~ X1+X2-X3)等。 I've never seen it used nakedly here.我从来没有见过它在这里赤身裸体地使用。

if I just run something similar in the console without the wrapping of model.matrix like so:如果我只是在控制台中运行类似的东西而不像这样包装 model.matrix :

> ~iris$Species

I only get this as the output:我只得到这个 output:

~iris$Species

Which honestly seems like an error, but I assume since R doesn't tell me it is an error that it's not actually.老实说,这似乎是一个错误,但我认为因为 R 并没有告诉我它实际上不是一个错误。

There's a lot here, let's go point by point.这里有很多,让我们逐点介绍go。 The main idea is that model.matrix() is designed to take the variables in your data set and transform them into a matrix format that is suitable for linear regression (you may be familiar with the linear-algebra expression for regression, y = X %*% beta ).主要思想是model.matrix()旨在将数据集中的变量转换为适合线性回归的矩阵格式(您可能熟悉回归的线性代数表达式, y = X %*% beta )。

For simple numerical covariates, the translation is simple — the variable becomes a column in X .对于简单的数值协变量,转换很简单——变量变成X中的一列。 Categorical variables (factors in R), however, have to be transformed to a set of binary variables that will represent differences in expected responses across categories, in a way that is defined by the contrasts associated with the factors.然而,分类变量(R 中的因子)必须转换为一组二元变量,这些变量将代表跨类别的预期响应差异,其方式由与因子相关的对比定义。

~Species +0 says to set up dummy variables for the Species factor: +0 says not to use an intercept. ~Species +0表示为Species因子设置虚拟变量: +0表示不使用截距。 In this special case, the dummy variables created are indicator variables — for a given row, the value in the column corresponding to the species for that observation is 1, the others are 0. Since (for example) the first observation is of "Setosa", Speciessetosa is 1, the other columns are zero.在这种特殊情况下,创建的虚拟变量是指示变量——对于给定的行,与该观察的物种对应的列中的值为 1,其他为 0。因为(例如)第一个观察是“Setosa ", Speciessetosa为 1,其他列为零。 If you had a vector of coefficients ( beta ) that contained the mean for each species, multiplying this X %*% beta would pick out the mean corresponding to the species for each observation.如果您有一个包含每个物种平均值的系数向量 ( beta ),则乘以这个X %*% beta将挑选出与每个观察的物种相对应的平均值。

(The R formula language is quite a rabbit hole: it can do useful, complicated stuff if the formula contains factors with different contrasts; interactions; or functions such as poly() or splines::ns() that create multi-variable predictors from a single input variable...) (R 公式语言是一个相当大的兔子洞:如果公式包含具有不同对比的因子;交互作用;或poly()splines::ns()等函数,它可以从单个输入变量...)

Question 2: in <response> ~ <stuff> , the <stuff> contains the input variables we need to define the model matrix.问题 2:在<response> ~ <stuff>中, <stuff>包含我们需要定义 model 矩阵的输入变量。 Until we actually fit the regression model, we don't need to know the response variable, so we can use a one-sided formula ~ <stuff> .在我们实际拟合回归 model 之前,我们不需要知道响应变量,因此我们可以使用单边公式~ <stuff>

Your last question: using ~ outside of a formula context tells R to keep whatever follows the tilde as an unevaluated expression .您的最后一个问题:在公式上下文之外使用~告诉 R 将波浪号后面的任何内容保留为未评估的表达式 For example:例如:

> x <- ~hello
> x
~hello
> x[[2]]
hello
> hello <- 5
> eval(x[[2]])
[1] 5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM