简体   繁体   English

线性回归中的测试数据集的负R平方?

[英]negative R-squared for a test dataset in linear regression?

I am doing a simulation of a linear regression using artificial data and then calculate manually the RSE and R Square. 我正在使用人工数据对线性回归进行仿真,然后手动计算RSE和R Square。 I do this for an in Sample dataset where the model is trained and then I test the model on an Out of Sample dataset. 我在训练样本的样本数据集中执行此操作,然后在样本外数据集上测试模型。 The Out of Sample and In Sample data are drawn from the same normal distribution but with different seeds. 样本外和样本内数据是从相同的正态分布中提取的,而种子不同。 My numbers though when it comes to the out-of-Sample dataset do not make sense. 尽管涉及样本外数据集,但我的数字没有任何意义。 Could you please help me locate the bug? 您能帮我找出错误吗?

set.seed(1)
z1 <- rnorm(100)
z2 <- z1 ^ 2
error <- rnorm(100, sd = 0.25)
y1 <- 1 + 2 * z1 + error
data1 <- data.table(y1, z1, z2)
model_quad <- lm(y1 ~ z1 + z2, data1)
model_lin <- lm(y1 ~ z1, data1)

confint(model_lin)
confint(model_quad)

summary(model_lin)
summary(model_quad)

ggplot(data1) +
  geom_point(aes(x = z1, y = y1), color = "blue", size = 3) +
  geom_point(aes(x = z2, y = y1), color = "red", size = 3) +
  geom_line(stat = "smooth", method = lm, aes(x = z1, y = y1), color = "blue", size = 2, alpha = 0.5) +
  geom_line(stat = "smooth", method = lm, aes(x = z2, y = y1), color = "red", size = 2, alpha = 0.5) +
  geom_ribbon(stat = "smooth", method = lm, aes(x = z1, y = y1), fill = "blue", alpha = 0.1) +
  geom_ribbon(stat = "smooth", method = lm, aes(x = z2, y = y1), fill = "red", alpha = 0.1) 

set.seed(100)
z12 <- rnorm(100)
z22 <- z12 ^ 2
error2 <- rnorm(100, sd = 0.25)
y2 <- 1 + 2 * z12 + error2
data2 <- data.table(y2, z12, z22)

summary(model_lin)
summary(model_quad)

ggplot(data2) +
  geom_point(aes(x = z12, y = y2), color = "blue", size = 3) +
  geom_point(aes(x = z22, y = y2), color = "red", size = 3) +
  geom_line(stat = "smooth", method = lm, aes(x = z12, y = y2), color = "blue", size = 2, alpha = 0.5) +
  geom_line(stat = "smooth", method = lm, aes(x = z22, y = y2), color = "red", size = 2, alpha = 0.5) +
  geom_ribbon(stat = "smooth", method = lm, aes(x = z12, y = y2), fill = "blue", alpha = 0.1) +
  geom_ribbon(stat = "smooth", method = lm, aes(x = z22, y = y2), fill = "red", alpha = 0.1) +
  geom_abline(intercept = 0.99, slope = 1.999, size = 2, color = "yellow", alpha = 0.3)


predictions_in_sample_linear <-  predict(model_lin, data1)
predictions_in_sample_quadratic <- predict(model_quad, data1)
predictions_out_of_sample_linear <-  predict(model_lin, data2)
predictions_out_of_sample_quadratic <- predict(model_quad, data2)
TSE_in_sample <- (y1 - mean(y1)) %*% (y1 - mean(y1))
RSE_in_sample_linear <- (predictions_in_sample_linear - y1)  %*% (predictions_in_sample_linear - y1) 
RSE_in_sample_quadratic <- (predictions_in_sample_quadratic - y1)  %*% (predictions_in_sample_quadratic - y1) 
R_Square_in_sample_linear <- (TSE_in_sample - RSE_in_sample_linear) / TSE_in_sample
R_Square_in_sample_quadratic<- (TSE_in_sample - RSE_in_sample_quadratic) / TSE_in_sample
TSE_out_of_sample <- (y2 - mean(y2)) %*% (y2 - mean(y2))
RSE_out_of_sample_linear <- (predictions_out_of_sample_linear - y2)  %*% (predictions_out_of_sample_linear - y2) 
RSE_out_of_sample_quadratic <- (predictions_out_of_sample_quadratic - y2)  %*% (predictions_out_of_sample_quadratic - y2) 
R_Square_out_of_sample_linear <- (TSE_out_of_sample - RSE_out_of_sample_linear) / TSE_out_of_sample
R_Square_out_of_sample_quadratic<- (TSE_out_of_sample - RSE_out_of_sample_quadratic) / TSE_out_of_sample

predictions_in_sample_linear 
predictions_in_sample_quadratic 
predictions_out_of_sample_linear 
predictions_out_of_sample_quadratic 
TSE_in_sample 
RSE_in_sample_linear 
RSE_in_sample_quadratic 
R_Square_in_sample_linear 
R_Square_in_sample_quadratic
TSE_out_of_sample 
RSE_out_of_sample_linear 
RSE_out_of_sample_quadratic 
R_Square_out_of_sample_linear 
R_Square_out_of_sample_quadratic

This code returns R_square in the Out of Sample data negative, which is absurd. 此代码在“超出样本”数据负数中返回R_square,这是荒谬的。

Your advice will be appreciated. 您的建议将不胜感激。

Long question but short answer. 问题多但答案短。 You should use 你应该用

data2 <- data.frame(y1 = y2, z1 = z12, z2 = z22)

This gives 这给

RSE_out_of_sample_linear
# 0.9902969

RSE_out_of_sample_quadratic
# 0.989241

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM