简体   繁体   中英

Cross-validation by hand in R

Trying to teach myself cross-validation on a super simple example, linear regression. My understanding is that when I build a model via CV, it should have a lower RMSE. But I find almost identical RMSEs with and without CV. Is this because I've coded too simple a problem? Or because I've implemented CV incorrectly?

Here is the code:

library(tidyverse)

set.seed(1234)
n = 1e4
x = rnorm(n) 
y = 1 + 3*x + rnorm(n)
df = data.frame(y,x)
df$id = 1:nrow(df) df$train = sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3))
df_train = subset(df, train)
df_test = subset(df, !train)
k = 10 # 10 folds
df_train$fold = sample(rep(1:k, each = nrow(df_train)/k))

results = vector(mode = 'list', length = k)
for(i in 1:k){
  cv_train = df_train[df_train$fold != i,] cv_test = df_train[df_train$fold == i,]
  m = lm(y ~ x, data = cv_train)
  cv_test$y_resid = cv_test$y - predict(m, newdata = cv_test)
  results[[i]] = cv_test
  rm(cv_train, cv_test, m)
}
df2 = bind_rows(results)
rmse_by_fold = df2 %>% group_by(fold) %>% summarise(rmse = sqrt(mean(y_resid**2)))
# rmse
mean(rmse_by_fold$rmse)


# no cross validation
m = lm(y~x, data = df_train)
y_resid = df_train$y - predict(m, newdata = df_train)
# rmse
sqrt(mean(y_resid**2))

There is a design flaw here: you are fitting the true model. To be precise, you simulated (x, y) from a straight line and fit a straight line. You will "perfectly" recover the truth, so you will observe that test error is about as same as training error.

Cross-Validation is used to combat overfitting. That is, you specify a model that is more complicated than the ground truth. For example, if the ground truth is a straight line but you train a quintic polynomial, then you will observe a small training error but a high test error.

When you have a sequence of candidate models, Cross-Validation can be used for model selection. Suppose the ground truth is a cubic polynomial, and you have 6 candidate models, namely polynomials of degree 1 to 6. Then you can compute a CV score for each choice of degree, and plot CV score against degree. You shall see that degree = 3 gives the minimal CV error.

My understanding is that when I build a model via CV, it should have a lower RMSE. But I find almost identical RMSEs with and without CV. Is this because I've coded too simple a problem?

You've misunderstood the point of cross validation. Cross validation does not improve your model . Your model is not better for having done cross validation. The technique is used to estimate the out of sample performance.

Let's see how this can be done in R without using a function from a library. Let's first begin my making a function which will sample from the true data generating process and then create a training set of 1000 samples. We'll add a column called fold which will tell us which fold the observation belongs to, and hence when the observation will be left out.

library(tidyverse)


make_data<-function(n){
  
  x = rnorm(n)
  y = 2*x + 1 + rnorm(n, 0, 1)
  tibble(x, y)
}


kfolds <- 10
train_n <-1000
train_data <- make_data(train_n) %>% 
              mutate(fold = sample(1:kfolds, size=train_n, replace=T))

We will now loop through each of the folds, leaving one out and training on the rest. We will fit a model on the held out set and make predictions on the held out set. We'll record the error in an array as we do the loop

cv_err = rep(0, kfolds)
for(i in 1:kfolds){
  in_data <- filter(train_data, fold!=i)
  out_data <- filter(train_data, fold==i)
  
  fit <- lm(y~x, data = in_data)
  preds <- predict(fit, newdata=out_data)
  
  err <- out_data$y - preds
  mse <- mean(err^2)
  # Record the RMSE
  cv_err[i] <- sqrt(mse)
}

Out estimate is then just

mean(cv_err)
# depending on your random seed
[1] 0.9908503

Which is pretty close to the standard deviation of the noise in the true data generating process (which is what RMSE estimates). Because we simulated this, we can just take a big sample from the process and estimate the error.

test_data <- make_data(1e6)
fit <- lm(y~x, data = train_data)
preds <- predict(fit, newdata=test_data)
err <- test_data$y - preds
mse <- mean(err^2)
sqrt(mse)
# Depending on the random seed
[1] 1.001989

Note that the model did not improve after doing this procedure. Our estimate is a good one (because we have correctly specified the conditional mean).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM