简体   繁体   English

测试数据的 R 平方

[英]R-squared on test data

I fit a linear regression model on 75% of my data set that includes ~11000 observations and 143 variables:我在 75% 的数据集上拟合了线性回归模型,其中包括 ~11000 个观察值和 143 个变量:

gl.fit <- lm(y[1:ceiling(length(y)*(3/4))] ~ ., data= x[1:ceiling(length(y)*(3/4)),]) #3/4 for training

and I got an R^2 of 0.43.我得到了 0.43 的 R^2。 I then tried predicting on my test data using the rest of the data:然后我尝试使用其余数据预测我的测试数据:

ytest=y[(ceiling(length(y)*(3/4))+1):length(y)]
x.test <- cbind(1,x[(ceiling(length(y)*(3/4))+1):length(y),]) #The rest for test
yhat <- as.matrix(x.test)%*%gl.fit$coefficients  #Calculate the predicted values

I now would like to calculate the R^2 value on my test data.我现在想计算我的测试数据的 R^2 值。 Is there any easy way to calculate that?有什么简单的计算方法吗?

Thank you谢谢

There are a couple of problems here.这里有几个问题。 First, this is not a good way to use lm(...) .首先,这不是使用lm(...)的好方法。 lm(...) is meant to be used with a data frame, with the formula expressions referencing columns in the df. lm(...)旨在与数据框一起使用,公式表达式引用 df 中的列。 So, assuming your data is in two vectors x and y ,因此,假设您的数据位于两个向量xy

set.seed(1)    # for reproducible example
x <- 1:11000
y <- 3+0.1*x + rnorm(11000,sd=1000)

df <- data.frame(x,y)
# training set
train <- sample(1:nrow(df),0.75*nrow(df))   # random sample of 75% of data

fit <- lm(y~x,data=df[train,])

Now fit has the model based on the training set.现在fit有了基于训练集的模型。 Using lm(...) this way allows you, for example to generate predictions without all the matrix multiplication.以这种方式使用lm(...)可以让您,例如,无需所有矩阵乘法即可生成预测。

The second problem is the definition of R-squared.第二个问题是R平方的定义。 The conventional definition is:传统的定义是:

1 - SS.residuals/SS.total 1 - SS.residuals/SS.total

For the training set, and the training set ONLY ,对于训练集和训练集 ONLY

SS.total = SS.regression + SS.residual SS.total = SS.regression + SS.residual

so所以

SS.regression = SS.total - SS.residual, SS.regression = SS.total - SS.residual,

and therefore因此

R.sq = SS.regression/SS.total R.sq = SS.regression/SS.total

so R.sq is the fraction of variability in the dataset that is explained by the model, and will always be between 0 and 1.所以 R.sq 是模型解释的数据集中可变性的分数,并且总是介于 0 和 1 之间。

You can see this below.你可以在下面看到这一点。

SS.total      <- with(df[train,],sum((y-mean(y))^2))
SS.residual   <- sum(residuals(fit)^2)
SS.regression <- sum((fitted(fit)-mean(df[train,]$y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 1.907349e-06
SS.regression/SS.total     # fraction of variation explained by the model
# [1] 0.08965502
1-SS.residual/SS.total     # same thing, for model frame ONLY!!! 
# [1] 0.08965502          
summary(fit)$r.squared     # both are = R.squared
# [1] 0.08965502

But this does not work with the test set (eg, when you make predictions from a model).但这不适用于测试集(例如,当您从模型进行预测时)。

test <- -train
test.pred <- predict(fit,newdata=df[test,])
test.y    <- df[test,]$y

SS.total      <- sum((test.y - mean(test.y))^2)
SS.residual   <- sum((test.y - test.pred)^2)
SS.regression <- sum((test.pred - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 8958890

# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total  
test.rsq
# [1] 0.0924713

# fraction of variability explained by the model
SS.regression/SS.total 
# [1] 0.08956405

In this contrived example there is not much difference, but it is very possible to have an R-sq.在这个人为的例子中,没有太大区别,但很可能有一个 R-sq。 value less than 0 (when defined this way).值小于 0(以这种方式定义时)。

If, for example, the model is a very poor predictor with the test set, then the residuals can actually be larger than the total variation in test set.例如,如果模型对测试集的预测效果非常差,那么残差实际上可能大于测试集的总变异。 This is equivalent to saying that the test set is modeled better using it's mean, than using the model derived from the training set.这相当于说测试集使用它的均值比使用从训练集派生的模型更好地建模。

I noticed that you use the first three quarters of your data as the training set, rather than taking a random sample (as in this example).我注意到您使用数据的前四分之三作为训练集,而不是随机抽取样本(如本例所示)。 If the dependance of y on x is non-linear, and the x 's are in order, then you could get a negative R-sq with the test set.如果yx的依赖是非线性的,并且x是有序的,那么您可以使用测试集获得负的 R-sq。

Regarding OP's comment below, one way to assess the model with a test set is by comparing in-model to out-of-model mean squared error (MSE).关于下面 OP 的评论,使用测试集评估模型的一种方法是比较模型内和模型外均方误差 (MSE)。

mse.train <- summary(fit)$sigma^2
mse.test  <- sum((test.pred - test.y)^2)/(nrow(df)-length(train)-2)

If we assume that the training and test set are both normally distributed with the same variance and having means which follow the same model formula, then the ratio should have an F-distribution with (n.train-2) and (n.test-2) degrees of freedom.如果我们假设训练集和测试集都是具有相同方差的正态分布并且具有遵循相同模型公式的均值,那么该比率应该具有具有 (n.train-2) 和 (n.test- 2) 自由度。 If the MSE's are significantly different based on an F-test, then the model does not fit the test data well.如果基于 F 检验的 MSE 有显着差异,则模型不能很好地拟合测试数据。

Have you plotted your test.y and pred.y vs x??您是否绘制了 test.y 和 pred.y 与 x 的关系图? This alone will tell you a lot.仅此一项就会告诉你很多。

Calculating R-squared on the testing data is a little tricky, as you have to remember what your baseline is.在测试数据上计算 R 平方有点棘手,因为您必须记住基线是什么。 Your baseline projection is a mean of your training data.您的基线预测是训练数据的平均值。

Therefore, extending the example provided by @jlhoward above:因此,扩展上面@jlhoward 提供的示例:

SS.test.total      <- sum((test.y - mean(df[train,]$y))^2)
SS.test.residual   <- sum((test.y - test.pred)^2)
SS.test.regression <- sum((test.pred - mean(df[train,]$y))^2)
SS.test.total - (SS.test.regression+SS.test.residual)
# [1] 11617720 not 8958890

test.rsq <- 1 - SS.test.residual/SS.test.total  
test.rsq
# [1] 0.09284556 not 0.0924713

# fraction of variability explained by the model
SS.test.regression/SS.test.total 
# [1] 0.08907705 not 0.08956405

Update: miscTools::rSquared() function makes an assumption that R-squared is calculated on the same dataset, on which the model is trained, as it calculates更新: miscTools::rSquared()函数假设 R-squared 是在相同的数据集上计算的,在该数据集上训练模型,因为它计算

yy <- y - mean(y)

behind the scenes in line 184 here: https://github.com/cran/miscTools/blob/master/R/utils.R此处第 184 行的幕后花絮: https : //github.com/cran/miscTools/blob/master/R/utils.R

If you want a function, the miscTools package has an rSquared function.如果你想要一个函数, miscTools包有一个rSquared函数。

require(miscTools)
r2 <- rSquared(ytest, resid = ytest-yhat)

When you use an R2 measure on an (out-of-) sample, you loose certain aspects of the interpretation of the R2:当您在(外)样本上使用 R2 度量时,您会失去对 R2 解释的某些方面:

  • the equivalence SSR total = SSR explained + SSR error等效 SSR 总数 = SSR 解释 + SSR 误差
  • The fact that R2 is equal to the squared of the correlation between y and predicted y R2 等于 y 和预测 y 之间相关性的平方这一事实
  • The fact that R2 is in [0,1] R2 在 [0,1] 中的事实

If you want to use R, I would recommend the function modelr::rsquare .如果你想使用 R,我会推荐函数modelr::rsquare Note this uses the SSR total from the test sample, not the training sample (as some people seem to advocate).请注意,这使用来自测试样本的 SSR 总数,而不是训练样本(有些人似乎提倡)。

Here I take an example where our train data has only 3 points, there is hence a high risk that we are having a bad model, and hence a poor out-of-sample performance, Indeed, you can see that the R2 is negative!这里我举一个例子,我们的训练数据只有 3 个点,因此我们有一个不好的模型的风险很高,因此样本外性能很差,事实上,你可以看到 R2 是负的!

library(modelr)

train <- mtcars[c(1,3,4),]
test  <- mtcars[-c(1,3,4),]

mod <- lm(carb ~ drat, data = train)

Compute on train data:计算列车数据:

## train
y_train <- train$carb
SSR_y_train <- sum((y_train-mean(y_train))^2)

cor(fitted(mod), y_train)^2
#> [1] 0.2985092
rsquare(mod, train)
#> [1] 0.2985092
1-sum(residuals(mod)^2)/SSR_y_train
#> [1] 0.2985092

Compute on test data:计算测试数据:

## test
pred_test <- predict(mod, newdata = test)
y_test <- test$carb
SSR_y_test <- sum((y_test-mean(y_test))^2)

cor(pred_test, y_test)^2
#> [1] 0.01737236
rsquare(mod, test)
#> [1] -0.6769549

1- 28* var(pred_test-y_test)/SSR_y_train
#> [1] -19.31621
1- 28* var(pred_test-y_test)/SSR_y_test
#> [1] -0.6769549

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM