简体   繁体   中英

All values of AUC ROC Curve 1 using tidymodels

Trying to do a LASSO model with a binary outcome using tidymodels, I have essentially copied the case study from the tidymodels webpage ( https://www.tidymodels.org/start/case-study/ )(the hotel stay dataset) and applied it to my own data but for some reason all of the values on my area under the ROC curve are coming out at 1 (as you can see from graph below). The only thing I have changed is the recipe (to try and suit my data)

  recipe(outcome ~ ., data = df_train) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_medianimpute(all_predictors())

so I don't know if it is my recipe that is incorrect or my data is not suitable for whatever reason. As mentioned I have a binary outcome and 68 predictors (59 factors and 9 numeric), some do have missing data but thought that the step_medianimpute would deal with that. Many thanks for any help anyone can offer

My AUC ROC Curve

Without seeing the data it is hard to know for sure, but your results indicate a couple of things.

Firstly, AUC ROC of 1. An AOC ROC of 1 for a binary classification model indicated that the model is perfectly able to separate the two classes. This could either be the case of overfitting or that you just have linearly separable classes.

Secondly, the constant metric value for different values of penalty. For a LASSO model, as the penalty increases, more and more variables will be shrunk to zero. In your case for all the values of the penalty (if you are following the post it will be 10^(-4) through 10^(-1) ) you are seeing the same performance. That means that even if you use a penalty of 10^(-1) you still haven't shrunk enough predictors to hurt/change the performance. Reprex below

set.seed(1234)
library(tidymodels)
response <- rep(c(0, 10), length.out = 1000)

data <- bind_cols(
  response = factor(response),
  map_dfc(seq_len(50), ~ rnorm(1000, response)) 
)

data_split <- initial_split(data)

data_train <- training(data_split)
data_test <- testing(data_split)

lasso_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
  set_engine("glmnet")

lasso_wf <- workflow() %>%
  add_model(lasso_spec) %>%
  add_formula(response ~ .)

data_folds <- vfold_cv(data_train)

param_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))

tune_res <- tune_grid(
  lasso_wf, 
  resamples = data_folds, 
  grid = param_grid
)

autoplot(tune_res)

What what you can do is expand the range of penalties until you the performance changes. Below we see that once the penalty is high enough, the last important predictors got shrunk to zero, and we lose performance.

param_grid <- tibble(penalty = 10^seq(-1, 0, length.out = 30))

tune_res <- tune_grid(
  lasso_wf, 
  resamples = data_folds, 
  grid = param_grid
)

autoplot(tune_res)

To verify, we fit the model using one of the good performance penalties and we get perfect predictions.

lasso_final <- finalize_workflow(lasso_wf, select_best(tune_res))

lasso_final_fit <- fit(lasso_final, data = data_train)

augment(lasso_final_fit, new_data = data_train) %>%
  conf_mat(truth = response, estimate = .pred_class)
#>           Truth
#> Prediction   0  10
#>         0  375   0
#>         10   0 375

Created on 2021-05-08 by the reprex package (v2.0.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM