简体   繁体   中英

Why are my test data R-squared's identical despite using different training data?

I'm fitting two linear models in R, one using a 'big' dataset, the other using a 'small' dataset which is a subset of the big dataset.

When I calculate out-of-sample R-squared from the two models the results are identical.

Can someone please explain this result? I expected the smaller dataset to have a lower R-squared due to having fewer datapoints with which to accurately estimate the relationship between the response and the predictors.

Reproduceable example below.

set.seed(1) 
x = rnorm(100)
set.seed(10)
y = x + rnorm(100)
dat = data.frame(x, y)
xtr_small = dat[1:5, ] #for training model, small dataset
xtr_big = dat[1:50, ] #for training model, big dataset
xte = dat[51:100, ]  #for out of sample testing

# Fit models, predict
fit_small = lm(y ~ x, xtr_small)
fit_big = lm(y ~ x, xtr_big)
pred_small = predict(fit_small, xte)
pred_big = predict(fit_big, xte)

# Rsquared's are identical, predictions arent
identical(cor(xte$y, pred_small)^2, cor(xte$y, pred_big)^2)  #TRUE
identical(pred_small, pred_big) #FALSE
```

This is a simple linear regression, so the predictions are a linear function of the x values. The correlation of y with a linear function of x is the same as the correlation of y with x ; the coefficients of the function don't matter.

Exceptions to this rule are slopes of zero (where correlation doesn't exist, because the sd of the predictions is zero), and negative slopes, where the correlation will change sign. But you're looking at squared correlation so the sign doesn't matter, and it's extremely unlikely to get a fitted slope that is exactly zero.

This is to help you understand what user2554330 means.

Let $x$ and $y$ be test data, and a predicted line be $\hat{y} = \hat{a} + \hat{b}x$ . Then

\begin{equation} \begin{split} \textrm{cor}(y, \hat{y}) &= \frac{\textrm{cov}(y, \hat{y})}{\sqrt{\textrm{var}(y)}\sqrt{\textrm{var}(\hat{y})}}\\ &= \frac{\textrm{cov}(y, \hat{a} + \hat{b}x)}{\sqrt{\textrm{var}(y)}\sqrt{\textrm{var}(\hat{a} + \hat{b}x)}}\\ &= \frac{\hat{b}\textrm{cov}(y, x)}{\sqrt{\textrm{var}(y)}\sqrt{\textrm{var}(x)}|\hat{b}|}\\ &= \frac{\hat{b}}{|\hat{b}|}\textrm{cor}(y, x) \end{split} \end{equation}

As a result, $R^2 = [\textrm{cor}(y, \hat{y})]^2 = \frac{\hat{b}^2}{|\hat{b}|^2}[\textrm{cor}(y, x)]^2 = [\textrm{cor}(y, x)]^2$ .

Note that the R-squared on the test data is independent of estimate of intercept and slope.

This only holds for simple linear regression. As soon as your model becomes $y = a + b_1x_1 + b_2x_2$ , the R-squared will depend on estimated coefficients.

Anyway, as I warned elsewhere, R-squared is not always appropriate for assessing out-of-sample prediction. You really want to compare mean prediction squared error, ie, mean((pred_small - xte$y) ^ 2) and mean((pred_big - xte$y) ^ 2) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM