简体   繁体   English

lm() 中不一致的 R 平方值

[英]inconsistent R-squared values in lm()

I am fitting the same linear model in two different ways, resulting in the same parameter estimates but differing R-squared values.我以两种不同的方式拟合相同的线性 model,导致相同的参数估计值但不同的 R 平方值。 Where does the difference come from?差异从何而来? Is this a bug in R?这是 R 中的错误吗? Here is my code:这是我的代码:

m1 <- lm(stack.loss ~ ., data = stackloss)
summary(m1)

X <- model.matrix(m1)
y <- stackloss$stack.loss
m2 <- lm(y ~ 0 + X)
summary(m2)

The output for m1 is the following (slightly shortened): m1的output如下(略短):

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

The output for m2 is has the same estimates for coefficients and residual standard error, but different R-squared values and different F-statistic: m2的 output 对系数和残差标准误差具有相同的估计值,但 R 平方值和 F 统计量不同:

             Estimate Std. Error t value Pr(>|t|)    
X(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
XAir.Flow      0.7156     0.1349   5.307  5.8e-05 ***
XWater.Temp    1.2953     0.3680   3.520  0.00263 ** 
XAcid.Conc.   -0.1521     0.1563  -0.973  0.34405    

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.979, Adjusted R-squared:  0.9741 
F-statistic: 198.2 on 4 and 17 DF,  p-value: 5.098e-14

Why are the R-squared values different?为什么 R 平方值不同?

This is discussed in this post and also this .这在这篇文章这个中都有讨论。 Here's a break down of what is happening in the lm() source code.下面是对lm()源代码中发生的事情的分解。 The relevant part:相关部分:

r <- z$residuals
f <- z$fitted.values
w <- z$weights
if (is.null(w)) {
        mss <- if (attr(z$terms, "intercept"))
            sum((f - mean(f))^2) else sum(f^2)
        rss <- sum(r^2)
}

Although you included an intercept, the attributes of the terms are not set to include an intercept, compare:尽管您包含了截距,但术语的属性并未设置为包含截距,比较:

attr(m1$terms,"intercept")
[1] 1

attr(m2$terms,"intercept")
[1] 0

I do not advise doing this, because you can easily use the formula interface to fit the model, without providing the model matrix yourself.我不建议这样做,因为您可以轻松地使用公式接口来拟合 model,而无需自己提供 model 矩阵。 But you can see by changing the attribute, you can get summary.lm to use the correct rss and get the correct r-squared:但是你可以看到通过改变属性,你可以得到summary.lm来使用正确的rss并得到正确的r-squared:

attr(m2$terms,"intercept") = 1
Call:
lm(formula = y ~ 0 + X)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2377 -1.7117 -0.4551  2.3614  5.6978 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
X(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
XAir.Flow      0.7156     0.1349   5.307  5.8e-05 ***
XWater.Temp    1.2953     0.3680   3.520  0.00263 ** 
XAcid.Conc.   -0.1521     0.1563  -0.973  0.34405    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

StupidWolf gave you the answer. StupidWolf 给了你答案。 You are estimating two different regression models.您正在估计两个不同的回归模型。

Because your second model specification m2 <- lm(y ~ 0 + X).因为您的第二个 model 规范 m2 <- lm(y ~ 0 + X)。 You are not estimating an intercept and you have a extra variable variable X(intercept).您没有估计截距,并且您有一个额外的变量变量 X(intercept)。

To get the same R^2 just correct the model要获得相同的 R^2,只需更正 model

m1 <- lm(stack.loss ~ ., data = stackloss)
summary(m1)

X <- model.matrix(m1)
y <- stackloss$stack.loss
m2 <- lm(y ~ X)
summary(m2)

Gives you the same R^2 since you regress the same model.为您提供相同的 R^2,因为您回归相同的 model。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM