简体   繁体   English

在群集上运行时在插入符中发生错误

[英]Error occurring in caret when running on a cluster

I am running the train function in caret on a cluster via doRedis . 我通过doRedis在集群中的插入符号中运行train功能。 For the most part, it works, but every so often I get errors at the very end of this nature: 在大多数情况下,它是有效的,但我经常在这种性质的最后得到错误:

error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>

and

Error in names(resamples) <- gsub("^\\.", "", names(resamples)) : 
  attempt to set an attribute on NULL

when I run traceback() I get: 当我运行traceback()我得到:

5: nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, 
       ppOpts = preProcess, ctrl = trControl, lev = classLevels, 
       ...)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(couple ~ ., training.balanced, method = "nnet", 
       preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)
1: caret::train(couple ~ ., training.balanced, method = "nnet", 
       preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)

These errors are not easily reproducible (ie they happen sometimes, but not consistently) and only occur at the end of the run. 这些错误不易重现(即它们有时发生,但不一致),并且只在运行结束时发生。 The stdout on the cluster shows all tasks running and completed, so I am a bit flummoxed. 群集上的stdout显示所有正在运行和完成的任务,所以我有点沮丧。

Has anyone encountered these errors? 有没有人遇到过这些错误? and if so understand the cause and even better a fix? 如果是这样理解原因,甚至更好的解决方案?

I imagine you've already solved this problem, but I ran into the same issue on my cluster consisting of linux and windows systems. 我想你已经解决了这个问题,但我在我的集​​群中遇到了同样的问题,包括linux和windows系统。 I was running the server on ubuntu 14.04 and had noticed the warnings when starting the server service about having 'transparent huge pages' enabled in the linux kernel. 我在ubuntu 14.04上运行服务器,并且在启动服务器服务时注意到有关在linux内核中启用“透明大页面”的警告。 I ignored that message and began running training exercises where most of the machines were maxed out with workers. 我忽略了这个信息并开始进行训练,大部分机器都与工人一起完成。 I received the same error at the end of the run: 我在运行结束时收到了同样的错误:

error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>

After a lot of head scratching and useless tinkering, I decided to address that warning by following the instructions here: http://ubuntuforums.org/showthread.php?t=2255151 经过大量的头痛和无用的修修补补后,我决定按照以下说明解决警告: http//ubuntuforums.org/showthread.php?t = 2255151

Essentially, I installed hugeadm using: 基本上,我安装了hugeadm使用:

sudo apt-get install hugeadm

Then disabled the transparent pages using: 然后禁用透明页面:

hugeadm --thp-never

Note that this change will be undone on restart of the computer. 请注意,重新启动计算机时将撤消此更改。

When I re-ran my training process it ran without any errors. 当我重新运行我的训练过程时,它运行没有任何错误。

Hope that helps. 希望有所帮助。

Cheers, Eric 干杯,埃里克

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM