简体   繁体   中英

Errors when running Caret package in R

I am attempting to build a model to predict whether a product will get sold on an ecommerce website with 1 or 0 being the output.

My data is a handful of categorical variables, one with a large amount of levels, a couple binary, and one continuous (the price), with an output variable of 1 or 0, whether or not the product listing got sold.

This is my code:

inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]


gbmfit<-gbm(Sale~., data=C,distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage=      .01,)
plot(gbmfit)


gbmTune<-train(Sale~.,data=CTrain, method="gbm")


ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~.,data=CTrain, 
           method="gbm", 
           verbose=FALSE, 
           trControl=ctrl)


ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction =    twoClassSummary)
gbmTune<-trainControl(Sale~., data=CTrain, 
                  method="gbm", 
                  metric="ROC", 
                  verbose=FALSE , 
                  trControl=ctrl)



  grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50),  .shrinkage=c(.01,.1))

  gbmTune<-train(Sale~., data=CTrain, 
           method="gbm", 
           metric="ROC", 
           tunegrid= grid, 
           verebose=FALSE,
           trControl=ctrl)



  set.seed(1)
  gbmTune <- train(Sale~., data = CTrain,
               method = "gbm",
               metric = "ROC",
               tuneGrid = grid,
               verbose = FALSE,
               trControl = ctrl)

I am running into two issues. The first is when I attempt add the summaryFunction=twoClasssummary, and then tune I get this:

Error in trainControl(Sale ~ ., data = CTrain, method = "gbm", metric = "ROC",  : 
  unused arguments (data = CTrain, metric = "ROC", trControl = ctrl)

The second problem if I decide bypass the summaryFunction, is when I try and run the model I get this error:

Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,  : 
  train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
  cannnot compute class probabilities for regression

I tried changing the output variable from a numeric value of 1 or 0, to just a text value, in excel, but that didn't make a difference.

Any help would be greatly appreciated on how to fix the fact that it's interpreting this model as a regression, or the first error message I am encountering.

Best,

Will will@nubimetrics.com

Your outcome is:

Sale = c(1L, 0L, 1L, 1L, 0L))

Although gbm expects it this way, it is pretty unnatural way to encode the data. Almost every other function uses factors.

So if you give train numeric 0/1 data, it thinks that you want to do regression. If you convert this to a factor and used "0" and "1" as the levels (and if you want class probabilities), you should have seen a warning that says "At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to...". That is not an idle warning.

Use factor levels that are valid R variable names and you should be fine.

Max

I was able to reproduce your error using the data(GermanCredit) dataset.

Your error comes from using trainControl as if it were gbm , train , or something.

If you check out the vignette's related documentation with ?trainControl then you will see that it's looking for input that's a lot different from what you're giving it.

This works:

require(caret)
require(gbm)
data(GermanCredit)

# Your dependent variable was Sale and it was binary
#   in place of Sale I will use the binary variable Telephone 

C      <- GermanCredit
C$Sale <- GermanCredit$Telephone

inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)

gbmfit<-gbm(Sale~Age+ResidenceDuration, data=C,
            distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage=      .01,)
plot(gbmfit)


gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, method="gbm")


ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, 
               method="gbm", 
               verbose=FALSE, 
               trControl=ctrl)


ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction =    twoClassSummary)

# gbmTune<-trainControl(Sale~Age+ResidenceDuration, data=CTrain, 
#                       method="gbm", 
#                       metric="ROC", 
#                       verbose=FALSE , 
#                       trControl=ctrl)

gbmTune <- trainControl(method = "adaptive_cv", 
                      repeats = 5,
                      verboseIter = TRUE,
                      seeds = seeds)

grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50),  .shrinkage=c(.01,.1))

gbmTune<-train(Sale~Age+ResidenceDuration, data=CTrain, 
               method="gbm", 
               metric="ROC", 
               tunegrid= grid, 
               verebose=FALSE,
               trControl=ctrl)



set.seed(1)
gbmTune <- train(Sale~Age+ResidenceDuration, data = CTrain,
                 method = "gbm",
                 metric = "ROC",
                 tuneGrid = grid,
                 verbose = FALSE,
                 trControl = ctrl)

Depending on what you're trying to accomplish you may want to re-specify that a little differently, but all it boils down to is that you used trainControl as if it were train .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM