简体   繁体   中英

Minor issue related to glmnet and LOOCV in R

I tried to fit a Regularization model (LASSO, Ridge, ElasticNet) using Leave one out cross-validation split with cv.glmnet and glmnet.

I used this DATASET , the dataset has 517 rows and 13 columns (two of them are categorical variables). The dependent variable is "area".

I would like to build a model without the categorical variables. Then is to calculate the coefficients mean of each LOOCV run, as well as the mean of R-squared and Root mean square.

The last 10 rows of the dataset are the following

tail(wdbc,10)
    X Y month day FFMC   DMC    DC  ISI temp RH wind rain  area
508 2 4   aug fri 91.0 166.9 752.6  7.1 25.9 41  3.6  0.0  0.00
509 1 2   aug fri 91.0 166.9 752.6  7.1 25.9 41  3.6  0.0  0.00
510 5 4   aug fri 91.0 166.9 752.6  7.1 21.1 71  7.6  1.4  2.17
511 6 5   aug fri 91.0 166.9 752.6  7.1 18.2 62  5.4  0.0  0.43
512 8 6   aug sun 81.6  56.7 665.6  1.9 27.8 35  2.7  0.0  0.00
513 4 3   aug sun 81.6  56.7 665.6  1.9 27.8 32  2.7  0.0  6.44
514 2 4   aug sun 81.6  56.7 665.6  1.9 21.9 71  5.8  0.0 54.29
515 7 4   aug sun 81.6  56.7 665.6  1.9 21.2 70  6.7  0.0 11.16
516 1 4   aug sat 94.4 146.0 614.7 11.3 25.6 42  4.0  0.0  0.00
517 6 3   nov tue 79.5   3.0 106.7  1.1 11.8 31  4.5  0.0  0.00

My attempted code is as follows

set.seed(123) 
  data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv") 
  data<-data[-(3:4)] 
  nrFolds <- 517
  folds <- rep_len(1:nrFolds, nrow(data))
  for(k in 1:nrFolds) {
      fold <- which(folds == k)
      data.train <- data[-fold,]
      data.test <- data[fold,]
      x.train <- as.matrix(data.train[-11])
      y.train <- as.matrix(data.train[11])
      x.test <- as.matrix(data.test[-11])
      y.test <- as.matrix(data.test[11])
      cv <- cv.glmnet(x.train, y.train, alpha = 0)
      # cv$lambda.min
      model <- glmnet(x.train, y.train, alpha = 0, lambda = cv$lambda.min)
      coef(model) 
      predictions <- model %>% predict(x.test) %>% as.vector()
      RMSE <- RMSE(predictions, data.test$area)
      Rsquare <- R2(predictions, data.test$area)
      LL <- list(cv,model,coef(model),predictions, RMSE,Rsquare)
  }
  LL

The code gives me only one prediction value! and R-square has NA value!. Also, I am not sure if I am right to use Leave one out cross-validation split with cv.glmnet and glmnet or not.

Any idea to help is appreciated.

Updated

I tried to use the caret package

I first split the dataset using LOOCV because I want every single observation to be in test process. Then I used train function to do my initial idea in the question.

My code

data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv") 
data<-data[-(3:4)] 
lambda <- 10^seq(-3, 3, length = 100)
predictions <- rep(0,nrow(data))
set.seed(123)
for(i in 1:nrow(data)){
  test.data<-data[i,]
  train.data<-data[-i,]
ridge <- train(
  area ~., data = train.data, method = "glmnet",
  trControl = trainControl("loocv", number = 517),
  tuneGrid = expand.grid(alpha = 0, lambda = lambda))
  coefs=coef(ridge$finalModel, ridge$bestTune$lambda)
  predictions[i] <- ridge %>% predict(test.data)
  RMSE = RMSE(predictions, test.data$area)
  Rsquare = R2(predictions, test.data$area)
  LL<-list(RMSE,Rsquare,coefs,predictions) 
}

This code also gives me the following error

Error in cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) : 
  incompatible dimensions
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

More Update

I also wrote the following code using caret package (Ridge for example)

set.seed(123)
data <- read.csv("forestfires.csv") 
data<-data[-(3:4)]  
lambda.grid <-10^seq(10,-2, length =100)
alpha.grid<- 0 #Ridge 
control <- trainControl(method="LOOCV")
srchGrd = expand.grid(alpha = alpha.grid, lambda = lambda.grid)
lm_model <- train(area ~ . , data=data, trControl=control,tuneGrid=srchGrd,method = 'glmnet')
lm_model
coef(lm_model$finalModel, lm_model$bestTune$lambda)

Am I right now?

OK, so now that you have stated it, the problem is quite obvious. Each time you are in the loop, you assign a new value to the LL. LL doesn't hold all results, just the last one. Try

LL <- lapply(1:nrFolds, function(k) {
  fold <- which(folds == k)
  data.train <- data[-fold,]
  data.test <- data[fold,]
  x.train <- as.matrix(data.train[-11])
  y.train <- as.matrix(data.train[11])
  x.test <- as.matrix(data.test[-11])
  y.test <- as.matrix(data.test[11])
  cv <- cv.glmnet(x.train, y.train, alpha = 0)
  # cv$lambda.min
  model <- glmnet(x.train, y.train, alpha = 0, lambda = cv$lambda.min)
  coef(model) 
  predictions <- model %>% predict(x.test) %>% as.vector()
  RMSE <- RMSE(predictions, data.test$area)
  Rsquare <- R2(predictions, data.test$area)
  list(CV=cv, model=model, coefs=coef(model), preds=predictions, rmse=RMSE, r2=Rsquare)
})

However, please look into the caret package: it automatizes creation of folds and testing. Also, I would calculate the RMSE and R² post hoc.

EDIT : Yes, R2 is NA . This is because in the above code, leave-one-out CV is attempted. rep_len as above returns the same as 1:517 , so fold is basically a number between 1 and 517, each data.train has 516 column and data.test is a vector. Therefore, predictions is a vector of length 1 and R2 cannot be calculated.

However, predictions are returned correctly.

preds <- sapply(LL, function(x) x$preds)

Not that they are very good, but the question was about running glmnet and not how to make the best predictions.

EDIT 2 : regarding your updated code. Here is what happens: you first initialize the vector predictions with 517 zeroes. In the first iteration of your loop, you tell caret to make a 517 LOO on a training set that contains 516 samples. Caret then returns the optimized ridge model (plus a lot of information which you ignore, like for example RMSE for each parameter value tested). Then, you take make predictions on your test set, which is one sample. You enter this one single sample in the predictions vector, which now contains 1 prediction and 516 zeroes. And then you are trying to calculate R2 and RMSE between predictions (which is a vector of one prediction and 516 zeroes) and the test response (which is a vector of one value). Which, not unexpectedly, fails. Then you store it all in a list called LL , which will get overwritten next time the loop is being run.

How I would do it: remove 33% of the data as validation. Use the remaining 66% with caret to train a single model (with LOOCV or K-FOLD in caret to optimize the parameters). Inspect the output of caret; make sure you look at the RMSE which caret provides (and R², if you use something else than LOOCV). Then, test the performance of your model on the validation set.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM