简体   繁体   English

尽管使用不同的训练数据,为什么我的测试数据 R-squared 是相同的?

[英]Why are my test data R-squared's identical despite using different training data?

I'm fitting two linear models in R, one using a 'big' dataset, the other using a 'small' dataset which is a subset of the big dataset.我在 R 中拟合两个线性模型,一个使用“大”数据集,另一个使用“小”数据集,它是大数据集的子集。

When I calculate out-of-sample R-squared from the two models the results are identical.当我从两个模型计算样本外 R 平方时,结果是相同的。

Can someone please explain this result?有人可以解释一下这个结果吗? I expected the smaller dataset to have a lower R-squared due to having fewer datapoints with which to accurately estimate the relationship between the response and the predictors.我预计较小的数据集具有较低的 R 平方,因为用于准确估计响应和预测变量之间关系的数据点较少。

Reproduceable example below.下面的可复制示例。

set.seed(1) 
x = rnorm(100)
set.seed(10)
y = x + rnorm(100)
dat = data.frame(x, y)
xtr_small = dat[1:5, ] #for training model, small dataset
xtr_big = dat[1:50, ] #for training model, big dataset
xte = dat[51:100, ]  #for out of sample testing

# Fit models, predict
fit_small = lm(y ~ x, xtr_small)
fit_big = lm(y ~ x, xtr_big)
pred_small = predict(fit_small, xte)
pred_big = predict(fit_big, xte)

# Rsquared's are identical, predictions arent
identical(cor(xte$y, pred_small)^2, cor(xte$y, pred_big)^2)  #TRUE
identical(pred_small, pred_big) #FALSE
```

This is a simple linear regression, so the predictions are a linear function of the x values.这是一个简单的线性回归,因此预测是x值的线性 function。 The correlation of y with a linear function of x is the same as the correlation of y with x ; yx的线性 function 的相关性与yx的相关性相同; the coefficients of the function don't matter. function 的系数无关紧要。

Exceptions to this rule are slopes of zero (where correlation doesn't exist, because the sd of the predictions is zero), and negative slopes, where the correlation will change sign.此规则的例外是斜率为零(不存在相关性,因为预测的 sd 为零)和负斜率,其中相关性将改变符号。 But you're looking at squared correlation so the sign doesn't matter, and it's extremely unlikely to get a fitted slope that is exactly zero.但是您正在查看平方相关,因此符号无关紧要,并且极不可能获得恰好为零的拟合斜率。

This is to help you understand what user2554330 means.这是为了帮助您了解user2554330 的含义。

Let $x$ and $y$ be test data, and a predicted line be $\hat{y} = \hat{a} + \hat{b}x$ .$x$$y$为测试数据,预测线为$\hat{y} = \hat{a} + \hat{b}x$ Then然后

\begin{equation} \begin{split} \textrm{cor}(y, \hat{y}) &= \frac{\textrm{cov}(y, \hat{y})}{\sqrt{\textrm{var}(y)}\sqrt{\textrm{var}(\hat{y})}}\\ &= \frac{\textrm{cov}(y, \hat{a} + \hat{b}x)}{\sqrt{\textrm{var}(y)}\sqrt{\textrm{var}(\hat{a} + \hat{b}x)}}\\ &= \frac{\hat{b}\textrm{cov}(y, x)}{\sqrt{\textrm{var}(y)}\sqrt{\textrm{var}(x)}|\hat{b}|}\\ &= \frac{\hat{b}}{|\hat{b}|}\textrm{cor}(y, x) \end{split} \end{equation} \begin{方程} \begin{split} \textrm{cor}(y, \hat{y}) &= \frac{\textrm{cov}(y, \hat{y})}{\sqrt{\textrm {var}(y)}\sqrt{\textrm{var}(\hat{y})}}\\ &= \frac{\textrm{cov}(y, \hat{a} + \hat{b} x)}{\sqrt{\textrm{var}(y)}\sqrt{\textrm{var}(\hat{a} + \hat{b}x)}}\\ &= \frac{\hat{ b}\textrm{cov}(y, x)}{\sqrt{\textrm{var}(y)}\sqrt{\textrm{var}(x)}|\hat{b}|}\\ &= \frac{\hat{b}}{|\hat{b}|}\textrm{cor}(y, x) \end{split} \end{方程}

As a result, $R^2 = [\textrm{cor}(y, \hat{y})]^2 = \frac{\hat{b}^2}{|\hat{b}|^2}[\textrm{cor}(y, x)]^2 = [\textrm{cor}(y, x)]^2$ .结果, $R^2 = [\textrm{cor}(y, \hat{y})]^2 = \frac{\hat{b}^2}{|\hat{b}|^2} [\textrm{cor}(y, x)]^2 = [\textrm{cor}(y, x)]^2$

Note that the R-squared on the test data is independent of estimate of intercept and slope.请注意,测试数据的 R 平方与截距和斜率的估计无关。

This only holds for simple linear regression.这仅适用于简单的线性回归。 As soon as your model becomes $y = a + b_1x_1 + b_2x_2$ , the R-squared will depend on estimated coefficients.一旦您的 model 变为$y = a + b_1x_1 + b_2x_2$ ,R 平方将取决于估计的系数。

Anyway, as I warned elsewhere, R-squared is not always appropriate for assessing out-of-sample prediction.无论如何,正如我在其他地方警告过的那样,R-squared 并不总是适合评估样本外预测。 You really want to compare mean prediction squared error, ie, mean((pred_small - xte$y) ^ 2) and mean((pred_big - xte$y) ^ 2) .您真的想比较平均预测平方误差,即mean((pred_small - xte$y) ^ 2)mean((pred_big - xte$y) ^ 2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM