简体   繁体   中英

Error occurring in caret when running on a cluster

I am running the train function in caret on a cluster via doRedis . For the most part, it works, but every so often I get errors at the very end of this nature:

error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>

and

Error in names(resamples) <- gsub("^\\.", "", names(resamples)) : 
  attempt to set an attribute on NULL

when I run traceback() I get:

5: nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, 
       ppOpts = preProcess, ctrl = trControl, lev = classLevels, 
       ...)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(couple ~ ., training.balanced, method = "nnet", 
       preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)
1: caret::train(couple ~ ., training.balanced, method = "nnet", 
       preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)

These errors are not easily reproducible (ie they happen sometimes, but not consistently) and only occur at the end of the run. The stdout on the cluster shows all tasks running and completed, so I am a bit flummoxed.

Has anyone encountered these errors? and if so understand the cause and even better a fix?

I imagine you've already solved this problem, but I ran into the same issue on my cluster consisting of linux and windows systems. I was running the server on ubuntu 14.04 and had noticed the warnings when starting the server service about having 'transparent huge pages' enabled in the linux kernel. I ignored that message and began running training exercises where most of the machines were maxed out with workers. I received the same error at the end of the run:

error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>

After a lot of head scratching and useless tinkering, I decided to address that warning by following the instructions here: http://ubuntuforums.org/showthread.php?t=2255151

Essentially, I installed hugeadm using:

sudo apt-get install hugeadm

Then disabled the transparent pages using:

hugeadm --thp-never

Note that this change will be undone on restart of the computer.

When I re-ran my training process it ran without any errors.

Hope that helps.

Cheers, Eric

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM