简体   繁体   English

model.matrix():在这种情况下为什么我无法控制对比度

[英]model.matrix(): why do I lose control of contrast in this case

Suppose we have a toy data frame: 假设我们有一个玩具数据框:

x <- data.frame(x1 = gl(3, 2, labels = letters[1:3]),
                x2 = gl(3, 2, labels = LETTERS[1:3]))

I would like to construct a model matrix 我想构造一个模型矩阵

#    x1b x1c x2B x2C
# 1    0   0   0   0
# 2    0   0   0   0
# 3    1   0   1   0
# 4    1   0   1   0
# 5    0   1   0   1
# 6    0   1   0   1

by: 通过:

model.matrix(~ x1 + x2 - 1, data = x,
             contrasts.arg = list(x1 = contr.treatment(letters[1:3]),
                                  x2 = contr.treatment(LETTERS[1:3])))

but actually I get: 但实际上我得到:

#   x1a x1b x1c x2B x2C
# 1   1   0   0   0   0
# 2   1   0   0   0   0
# 3   0   1   0   1   0
# 4   0   1   0   1   0
# 5   0   0   1   0   1
# 6   0   0   1   0   1
# attr(,"assign")
# [1] 1 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x1
#   b c
# a 0 0
# b 1 0
# c 0 1

# attr(,"contrasts")$x2
#   B C
# A 0 0
# B 1 0
# C 0 1

I am sort of confused here: 我在这里有些困惑:

  • I have passed in explicit contrast matrix to drop first factor levels; 我已经传递了明确的对比度矩阵来降低第一因子水平;
  • I have asked for dropping intercept. 我要求放下拦截。

Then why am I getting a model matrix with 5 columns? 那为什么我要得到一个5列的模型矩阵呢? How can I get the model matrix I want? 如何获得所需的模型矩阵?

Whenever we lose control of something at R level, there must be some default, unchangable behaviour at C level. 每当我们失去对R级别的控制时,在C级别必须有一些默认的,不变的行为。 C code for model.matrix.default() can be found in R source package at: 可以在R源代码包中找到model.matrix.default() C代码:

R-<release_number>/src/library/stats/src/model.c

We can find the explanation here: 我们可以在这里找到解释:

/* If there is no intercept we look through the factor pattern */
/* matrix and adjust the code for the first factor found so that */
/* it will be coded by dummy variables rather than contrasts. */

Let's make a small test on this, with a data frame 让我们用数据框对此做一个小测试

x <- data.frame(x1 = gl(2, 2, labels = letters[1:2]), x2 = sin(1:4))
  1. if we only have x2 on the RHS, we can drop intercept successfully: 如果我们在RHS上只有x2 ,我们可以成功地丢弃拦截:

     model.matrix(~ x2 - 1, data = x) # x2 #1 0.8414710 #2 0.9092974 #3 0.1411200 #4 -0.7568025 
  2. if we have only x1 on the RHS, contrast is not applied: 如果我们在RHS上只有x1 ,则不应用对比度:

     model.matrix(~ x1 - 1, data = x) # x1a x1b #1 1 0 #2 1 0 #3 0 1 #4 0 1 
  3. when we have both x1 and x2 , contrast is not applied: 当我们同时拥有x1x2 ,不应用对比度:

     model.matrix(~ x1 + x2 - 1, data = x) # x1a x1b x2 #1 1 0 0.8414710 #2 1 0 0.9092974 #3 0 1 0.1411200 #4 0 1 -0.7568025 

This implies that while there is difference between: 这意味着尽管两者之间存在差异:

lm(y ~ x2, data = x)
lm(y ~ x2 - 1, data = x)

there is no difference between 两者之间没有区别

lm(y ~ x1, data = x)
lm(y ~ x1 - 1, data = x)

or 要么

lm(y ~ x1 + x2, data = x)
lm(y ~ x1 + x2 - 1, data = x)

The reason for such behaviour is not to ensure numerical stability, but to ensure the sensibility of estimation / prediction. 出现这种现象的原因不是为了确保数值稳定性,而是为了确保估计/预测的敏感性。 If we really drop the intercept while applying contrast to x1 , we end up with a model matrix: 如果我们在对x1施加对比度时确实放下了截距,则最终得到一个模型矩阵:

    #  x1b
    #1   0
    #2   0
    #3   1
    #4   1

The effect is that we constrain estimation for level a to 0. 结果是我们将级别a估计限制为0。

In this post: How can I force dropping intercept or equivalent in this linear model? 在这篇文章中: 如何在此线性模型中强制下降截距或等效截距? , we have a dataset: ,我们有一个数据集:

#           Y    X1    X2
#1  1.8376852  TRUE  TRUE
#2 -2.1173739  TRUE FALSE
#3  1.3054450 FALSE  TRUE
#4 -0.3476706  TRUE FALSE
#5  1.3219099 FALSE  TRUE
#6  0.6781750 FALSE  TRUE

There isn't joint existence (X1 = FALSE, X2 = FALSE) in this dataset. 该数据集中没有联合存在(X1 = FALSE, X2 = FALSE) But in broad sense, model.matrix() has to do something safe and sensible. 但是从广义上讲, model.matrix()必须做一些安全且明智的事情。 It is biased to assume that no joint existence of two factor levels in the training dataset implies that they need not be predicted. 有偏颇的假设是,训练数据集中没有两个因子水平的共同存在意味着不需要预测它们。 If we really drop intercept while applying contrast, such joint existence is constrained at 0. However, the OP of that post deliberately wants such non-standard behaviour (for some reason), in which case, a possible workaround was given in my answer there. 如果我们在应用对比度时确实丢弃了截距,则这种联合存在被约束为0。但是,该职位的OP故意想要这种非标准行为(出于某种原因),在这种情况下,我的答案给出了一种可能的解决方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM