简体   繁体   中英

logistic regression with caret and glmnet in R

I'm trying to fit a logistic regression model to my data, using glmnet (for lasso) and caret (for k-fold cross-validation). I've tried two different syntaxes, but they both throw an error:

fitControl <- trainControl(method = "repeatedcv",
                       number = 10,
                       repeats = 3,
                       verboseIter = TRUE)

# with response as a integer (0/1)
fit_logistic <- train(response ~.,
                   data = df_without,
                   method = "glmnet",
                   trControl = fitControl,
                   family = "binomial")

Error in cut.default(y, breaks, include.lowest = TRUE) : 
 invalid number of intervals

df_without$response <- as.factor(df_without$response)
# with response as a factor
fit_logistic <- train(as.matrix(df_without[1:47]), df_without$response,
              method = "glmnet",
              trControl = fitControl,
              family = "binomial")

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning message:
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  NAs introduced by coercion

Do I need to convert my dataframe to a matrix or not?

Does my response variable need to be a factor or just 0/1 integers?

The .Rdata file with the df_without data frame is here .

sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.1 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils         datasets  methods   base     

other attached packages:
 [1] e1071_1.6-4     plyr_1.8.2      gbm_2.1.1       survival_2.38-1     glmnet_2.0-2    foreach_1.4.2  
 [7] Matrix_1.2-0    caret_6.0-47    ggplot2_1.0.1   lattice_0.20-31     lubridate_1.3.3 RJDBC_0.2-5    
[13] rJava_0.9-6     DBI_0.3.1      

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6         compiler_3.2.0      nloptr_1.0.4            class_7.3-12        iterators_1.0.7    
 [6] tools_3.2.0         digest_0.6.8        lme4_1.1-7              memoise_0.2.1       nlme_3.1-120       
[11] gtable_0.1.2        mgcv_1.8-6          brglm_0.5-9             SparseM_1.6         proto_0.3-10       
[16] BradleyTerry2_1.0-6 stringr_1.0.0       gtools_3.5.0            grid_3.2.0          nnet_7.3-9         
[21] minqa_1.2.4         reshape2_1.4.1      car_2.0-25              magrittr_1.5        scales_0.2.4       
[26] codetools_0.2-11    MASS_7.3-40         pbkrtest_0.4-2          colorspace_1.2-6    quantreg_5.11      
[31] stringi_0.4-1       munsell_0.4.2  

I had the same problem, I fixed mine using the function model.matrix to deal with the coding of categorical variables.

Try this for the x argument in glmnet:

as.matrix(model.matrix(response ~ .)[, -1])

I removed the intercept column because the default in glmnet is to include an intercept.

The problem is that you have continuous variables in your dataset. GLMNET needs to have factor of binary variables.

If you run your first lines of code and select a few non-continuous variables you will see that it runs as expected.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM