[英]Why is h2o.saveModel hanging in R v3.3.2 and H2O v3.10.4.2
23Jun2017: Yet another update... 23Jun2017:又一次更新......
11Apr2017: I added another update below... 11Apr2017:我在下面添加了另一个更新...
I added an update below... 我在下面添加了更新...
We have developed a model using gradient boosting machine (GBM). 我们使用梯度增强机(GBM)开发了一个模型。 This model was originally developed using H2O v3.6.0.8 via R v3.2.3 on a Linux machine: 该模型最初是在Linux机器上通过R v3.2.3使用H2O v3.6.0.8开发的:
$ uname -a
Linux xrdcldapprra01.unix.medcity.net 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
The following code has been working fine for months: 以下代码已经好几个月了:
modelname <- 'gbm_34325f.hex'
h2o.gbm(x = predictors, y = "outcome", training_frame = modified.hex,
validation_frame = modified_holdout.hex, distribution="bernoulli",
ntrees = 6000, learn_rate = 0.01, max_depth = 5,
min_rows = 40, model_id = modelname)
gbm <- h2o.getModel(modelname)
h2o.saveModel( gbm, path='.', force = TRUE )
Last week we upgraded the Linux machine to: 上周我们将Linux机器升级为:
As shown here in the output from h2o.init()
: 如h2o.init()
的输出所示:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 days 1 hours
H2O cluster version: 3.10.4.2
H2O cluster version age: 14 days, 22 hours and 48 minutes
H2O cluster name: H2O_started_from_R_bac_ytl642
H2O cluster total nodes: 1
H2O cluster total memory: 18.18 GB
H2O cluster total cores: 64
H2O cluster allowed cores: 64
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)
I am now rebuilding this model from scratch in the newer version of R and H2O. 我现在正在从较新版本的R和H2O中重建这个模型。 When I run the above R/H2O code, it hangs on this command: 当我运行上面的R / H2O代码时,它挂起在这个命令上:
h2o.saveModel( gbm, path='.', force = TRUE )
While my program is hung at h2o.saveModel
, I started another R/H2O session and connected to the currently hung process. 当我的程序挂在h2o.saveModel
,我启动了另一个R / H2O会话并连接到当前挂起的进程。 I can successfully get the model. 我可以成功获得模型。 I can successfully run h2o.saveModelDetails
and save it as JSON. 我可以成功运行h2o.saveModelDetails
并将其保存为JSON。 And I can save it as MOJO. 我可以把它保存为MOJO。 However, I cannot save it as a native 'hex' model via h2o.saveModel
. 但是,我无法通过h2o.saveModel
将其保存为本机“十六进制”模型。
These are my commands and output from my connected session (while the original session remains hung up): 这些是我的命令和连接会话的输出(原始会话保持挂起状态):
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 days 1 hours
H2O cluster version: 3.10.4.2
H2O cluster version age: 14 days, 22 hours and 48 minutes
H2O cluster name: H2O_started_from_R_bac_ytl642
H2O cluster total nodes: 1
H2O cluster total memory: 18.18 GB
H2O cluster total cores: 64
H2O cluster allowed cores: 64
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)
> modelname <- 'gbm_34325f.hex'
> gbm <- h2o.getModel(modelname)
> gbm
Model Details:
==============
H2OBinomialModel: gbm
Model ID: gbm_34325f.hex
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 6000 6000 839613730 5
max_depth mean_depth min_leaves max_leaves mean_leaves
1 5 5.00000 6 32 17.51517
[ snip ]
> model_path <- h2o.saveModelDetails( object=gbm, path='.', force=TRUE )
> model_path
[1] "/home/bac/gbm_34325f.hex.json"
# file created:
# -rw-rw-r-- 1 bac bac 552K Apr 2 12:20 gbm_34325f.hex.json
#
# first few characters are:
# {"__meta":{"schema_version":3,"schema_name":"GBMModelV3","schema_type":"GBMModel"},
> h2o.saveMojo( gbm, path='.', force=TRUE )
[1] "/home/bac/gbm_34325f.hex.zip"
# file created:
# -rw-rw-r-- 1 bac bac 7120899 Apr 2 11:57 gbm_34325f.hex.zip
#
# when I unzip this file, things look okay (altho MOJOs are new to me).
> h2o.saveModel( gbm, path='.', force=TRUE )
[ this hangs and never returns; i have to kill the entire R session ]
# empty file created:
# -rw-rw-r-- 1 bac bac 0 Apr 2 12:00 gbm_34325f.hex
I then access this hung-up process via the web interface H2OFlow. 然后,我通过Web界面H2OFlow访问这个挂起的进程。 Again, I can load and view the model. 我再次加载并查看模型。 When I try to export the model, an empty .hex
file is created and I see the message: 当我尝试导出模型时,会创建一个空的.hex
文件,我看到了以下消息:
Waiting for 2 responses...
( 2 responses
because I exported twice.) ( 2 responses
因为我出口了两次。)
To be clear, I am not loading an old model. 要清楚,我没有加载旧模型。 Rather, I am rebuilding the model from scratch in the new R/H2O environment. 相反,我正在新的R / H2O环境中从头开始重建模型。 I am, however, using the same R/H2O code that was successful in the older environment. 但是,我使用的是在旧环境中成功的相同R / H2O代码。
Any ideas of what is going on? 对于发生了什么的任何想法? Thanks. 谢谢。
The problem I have -- h2o.saveModel
hangs -- is related to OOM
(out of memory). 我h2o.saveModel
的问题 - h2o.saveModel
挂起 - 与OOM
(内存不足)有关。
I see these messages in the .out
file created when I h2o.init
: 我在h2o.init
创建的.out
文件中看到这些消息:
Note: In case of errors look at the following log files:
/tmp/RtmpOnJn83/h2o_bfo7328_started_from_r.out
/tmp/RtmpOnJn83/h2o_bfo7328_started_from_r.err
$ tail -n 6 h2o_bfo7328_started_from_r.out
[ I removed the timestamp / IP info to help made this readable ]
FJ-1-107 INFO: 2017-04-04 01:27:04 30 min 56.196 sec 6000 0.25485 0.22119 0.96950 3.54582 0.08634
2946-780 INFO: GET /3/Models/gbm_34325f.hex, parms: {}
2946-780 INFO: GET /3/Models/gbm_34325f.hex, parms: {}
946-1102 INFO: GET /99/Models.bin/gbm_34325f.hex, parms: {dir=/opt/app/STUFF/bpci/training/facility_models/gbm_34325f.hex, force=TRUE}
946-1102 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:3.15 GB + POJO:Zero + FREE:441.54 GB == MEM_MAX:444.44 GB), desiredKV=299.74 GB OOM!
946-1102 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:3.15 GB + POJO:Zero + FREE:441.54 GB == MEM_MAX:444.44 GB), desiredKV=299.74 GB OOM!
Once I realized this was an OOM issue, I changed my h2o.init
to include max_mem_size
: 一旦我意识到这是一个OOM问题,我改变了我的h2o.init
以包含max_mem_size
:
localH2O = h2o.init(ip = "localhost", port = 54321, nthreads = -1, max_mem_size = '500G')
Even with max_mem_size = '500G'
set this high, I still get a OOM error (see above). 即使max_mem_size = '500G'
设置为高,我仍然会收到OOM错误(见上文)。
When I was running H2O v3.6.0.8, I didn't explicitly define max_mem_size
. 当我运行H2O v3.6.0.8时,我没有明确定义max_mem_size
。
I am curious: Now that I've upgraded to H2O v3.10.4.2, is there a larger memory demand? 我很好奇:现在我升级到H2O v3.10.4.2,是否有更大的内存需求? What was the default max_mem_size
in H2O v3.6.0.8? H2O v3.6.0.8中的默认max_mem_size
是什么?
Any idea of what changed memory-wise between the two versions of H2O? 想知道两个版本的H2O之间的内存改变了什么? And how I can get this to run again? 我怎么能让它再次运行?
Thanks! 谢谢!
I hoped to share the dataset that generates this error. 我希望共享生成此错误的数据集。 Unfortunately, the data contains protected information so I cannot share it. 不幸的是,数据包含受保护的信息,因此我无法共享它。 I created a 'scrubbed' version of this file -- contains nonsense data -- but I found it much too difficult to run this scrubbed data through our model training R code because of various dependencies and validation checks. 我创建了这个文件的“擦洗”版本 - 包含无意义的数据 - 但我发现由于各种依赖性和验证检查,通过我们的模型训练R代码运行这个擦除数据太困难了。
I have a general sense of what sorts of parameters cause the OOM (out of memory) error during h2o.saveModel
. 我一般意识到在h2o.saveModel
期间哪种类型的参数导致OOM(内存h2o.saveModel
)错误。
Causes errors: 导致错误:
Does not cause errors: 不会导致错误:
Does not cause errors: 不会导致错误:
There is some combination of number of records, number of columns, and ntrees that eventually causes OOM. 有一些记录数,列数和最终导致OOM的树的组合。
Setting max_mem_size
does not help at all. 设置max_mem_size
根本没有用。 I set it to '100G', '200G', and '300G' and still OOM during h2o.saveModel
. 我将它设置为'100G','200G'和'300G',并且在h2o.saveModel
期间仍然是OOM。
Because I cannot compromise on number of records and number of columns used to train and on the number of trees needed in the GBM, I had to go back to an earlier version of h2o. 因为我不能在记录的数量和用于训练的列数以及GBM中所需的树数量上妥协,所以我不得不回到早期版本的h2o。
After working with ten different versions of h2o, I found the most recent released version that does not produce OOM. 在使用十个不同版本的h2o之后,我找到了最新发布的版本,它不会产生OOM。 The versions and the results are: 版本和结果是:
I am not using v3.11.0.3839 since it seems to be 'bleeding edge'. 我没有使用v3.11.0.3839,因为它似乎是“前沿”。 I am currently running with v3.10.2.1. 我目前正在运行v3.10.2.1。
I hope this helps someone track down this bug. 我希望这可以帮助有人追踪这个错误。
I was able to fix this problem by: 我能够解决这个问题:
min_mem_size
and max_mem_size
during h2o.init()
同时设置min_mem_size
和max_mem_size
期间h2o.init()
See: https://stackoverflow.com/a/44724813/7733787 请参阅: https : //stackoverflow.com/a/44724813/7733787
As this problem is directly related with memory let have you set memory properly for your h2o instance and make sure the setting is working. 由于此问题与内存直接相关,请让您为h2o实例正确设置内存并确保设置正常。 As you are setting max_mem_size randomly to an arbitrary number (100g, 200g, 300g) it is not going to help. 当你将max_mem_size随机设置为任意数字(100g,200g,300g)时,它无济于事。 First we need to know total RAM in your machine and then you can use about 80% of this memory for your h2o instance. 首先,我们需要了解您机器中的总RAM,然后您可以将大约80%的内存用于您的h2o实例。
For example I have 16GB in my machine and I want to give 12GB for H2O instance when started from RI will do the following: 例如,我的机器中有16GB,我想从RI启动时为H2O实例提供12GB将执行以下操作:
h2o.init(max_mem_size = "12g")
Once H2O is up and running I will get confirmation of memory set for H2O process as below: 一旦H2O启动并运行,我将获得H2O过程的内存设置确认,如下所示:
R is connected to the H2O cluster:
H2O cluster uptime: 2 seconds 166 milliseconds
H2O cluster version: 3.10.4.3
H2O cluster version age: 12 days
H2O cluster name: H2O_started_from_R_avkashchauhan_kuc791
H2O cluster total nodes: 1
H2O cluster total memory: 10.67 GB <=== [memory setting working]
H2O cluster total cores: 8
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)
If you change your dataset size during various model building step you will see OOM with random row count because sometime Java GC will clear the unused memory and sometimes waiting to clear. 如果在各种模型构建步骤中更改数据集大小,您将看到具有随机行数的OOM,因为有时Java GC将清除未使用的内存,有时等待清除。 So you will hit OOM once with N number and sometime you will not hit OOM with 2N numbers in the same java instance. 因此,您将使用N编号命中OOM一次,有时您不会在同一个Java实例中使用2N编号命中OOM。 So chasing that route is not useful. 所以追逐那条路线是没用的。
It definitely a memory related issue and make sure you give good enough memory to H2O cluster and then see how it works. 它肯定是一个与内存相关的问题,并确保你给H2O集群提供足够的内存,然后看看它是如何工作的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.