简体   繁体   中英

Error when calculating variable importance with categorical variables using the caret package (varImp)

I've been trying to compute the variable importance for a model with mixed scale features using the varImp function in the caret package. I've tried a number of approaches, including renaming and coding my levels numerically. In each case, I am getting the following error:

Error in auc3_(actual, predicted, ranks) : 
  Not compatible with requested type: [type=character; target=double].

The following dummy example should illustrate my point (edited to reflect @StupidWolf's correction):

library(caret)

#create small dummy dataset
set.seed(124)
dummy_data = data.frame(Label = factor(sample(c("a","b"),40, replace = TRUE)))
dummy_data$pred1 = ifelse(dummy_data$Label=="a",rnorm(40,-.5,2),rnorm(40,.5,2))
dummy_data$pred2 = factor(ifelse(dummy_data$Label=="a",rbinom(40,1,0.3),rbinom(40,1,0.7)))


# check varImp
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
model.lvq <- caret::train(Label~., data=dummy_data, 
                          method="lvq", preProcess="scale", trControl=control.lvq)
varImp.lvq <- caret::varImp(model.lvq, scale=FALSE)                       

The issue persists when using different models (like randomForest and SVM).

If anyone knows a solution or can tell me what is going wrong, I would highly appreciate that.

Thanks!

When you call varImp on lvq, it defaults to filterVarImp() because there is no specific variable importance for this model. Now if you check the help page :

For two class problems, a series of cutoffs is applied to the predictor data to predict the class. The sensitivity and specificity are computed for each cutoff and the ROC curve is computed.

Now if you read the source code of varImp.train() that feeds the data into filterVarImp() , it is the original dataframe and not whatever comes out of the preprocess.

This means in the original data, if you have a variable that is a factor, it cannot cut the variable, it will throw and error like this:

filterVarImp(data.frame(dummy_data$pred2),dummy_data$Label)
Error in auc3_(actual, predicted, ranks) : 
  Not compatible with requested type: [type=character; target=double].

So using my example and like you have pointed out, you need to onehot encode it:

set.seed(111)
dummy_data = data.frame(Label = rep(c("a","b"),each=20))
dummy_data$pred1 = rnorm(40,rep(c(-0.5,0.5),each=20),2)
dummy_data$pred2 = rbinom(40,1,rep(c(0.3,0.7),each=20))
dummy_data$pred2 = factor(dummy_data$pred2)

control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)

ohe_data = data.frame(
            Label = dummy_data$Label,
            model.matrix(Label ~ 0+.,data=dummy_data))

model.lvq <- caret::train(Label~., data=ohe_data, 
                          method="lvq", preProcess="scale",
                       trControl=control.lvq)

caret::varImp(model.lvq, scale=FALSE)  

ROC curve variable importance

       Importance
pred1      0.6575
pred20     0.6000
pred21     0.6000

If you use a model that doesn't have a specific variable importance method, then one option is that you can already calculate the variable importance first, and run the model after that.

Note that this problem can be circumvented by replacing ordinal features (with d levels) by its (d-1)-dimensional indicator encoding:

model.matrix(~dummy_data$pred2-1)[,1:(length(levels(dummy_data$pred2)-1)]

However, why does varImp not handle this automatically? Further, this has the drawback that it yields an importance score for each of the d-1 indicators, not one unified importance score for the original feature.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM