为什么h2o.saveModel挂在R v3.3.2和H2O v3.10.4.2中

Question

23Jun2017: Yet another update... 23Jun2017：又一次更新......
11Apr2017: I added another update below... 11Apr2017：我在下面添加了另一个更新...
I added an update below... 我在下面添加了更新...

We have developed a model using gradient boosting machine (GBM). 我们使用梯度增强机（GBM）开发了一个模型。 This model was originally developed using H2O v3.6.0.8 via R v3.2.3 on a Linux machine: 该模型最初是在Linux机器上通过R v3.2.3使用H2O v3.6.0.8开发的：

$ uname -a
Linux xrdcldapprra01.unix.medcity.net 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

The following code has been working fine for months: 以下代码已经好几个月了：

modelname <- 'gbm_34325f.hex'
h2o.gbm(x = predictors, y = "outcome", training_frame = modified.hex,
    validation_frame = modified_holdout.hex, distribution="bernoulli",
    ntrees = 6000, learn_rate = 0.01, max_depth = 5,
    min_rows = 40, model_id = modelname)
gbm <- h2o.getModel(modelname)
h2o.saveModel( gbm, path='.', force = TRUE )

Last week we upgraded the Linux machine to: 上周我们将Linux机器升级为：

R: v 3.3.2 R：v 3.3.2
H2O: v 3.10.4.2 H2O：v 3.10.4.2

As shown here in the output from h2o.init() : 如h2o.init()的输出所示：

> h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 days 1 hours 
    H2O cluster version:        3.10.4.2 
    H2O cluster version age:    14 days, 22 hours and 48 minutes  
    H2O cluster name:           H2O_started_from_R_bac_ytl642 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   18.18 GB 
    H2O cluster total cores:    64 
    H2O cluster allowed cores:  64 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.2 (2016-10-31)

I am now rebuilding this model from scratch in the newer version of R and H2O. 我现在正在从较新版本的R和H2O中重建这个模型。 When I run the above R/H2O code, it hangs on this command: 当我运行上面的R / H2O代码时，它挂起在这个命令上：

h2o.saveModel( gbm, path='.', force = TRUE )

While my program is hung at h2o.saveModel , I started another R/H2O session and connected to the currently hung process. 当我的程序挂在h2o.saveModel ，我启动了另一个R / H2O会话并连接到当前挂起的进程。 I can successfully get the model. 我可以成功获得模型。 I can successfully run h2o.saveModelDetails and save it as JSON. 我可以成功运行h2o.saveModelDetails并将其保存为JSON。 And I can save it as MOJO. 我可以把它保存为MOJO。 However, I cannot save it as a native 'hex' model via h2o.saveModel . 但是，我无法通过h2o.saveModel将其保存为本机“十六进制”模型。

These are my commands and output from my connected session (while the original session remains hung up): 这些是我的命令和连接会话的输出（原始会话保持挂起状态）：

> h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 days 1 hours 
    H2O cluster version:        3.10.4.2 
    H2O cluster version age:    14 days, 22 hours and 48 minutes  
    H2O cluster name:           H2O_started_from_R_bac_ytl642 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   18.18 GB 
    H2O cluster total cores:    64 
    H2O cluster allowed cores:  64 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.2 (2016-10-31) 

> modelname <- 'gbm_34325f.hex'
> gbm <- h2o.getModel(modelname)
> gbm
Model Details:
==============

H2OBinomialModel: gbm
Model ID:  gbm_34325f.hex 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1            6000                     6000           839613730         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          6         32    17.51517
[ snip ]

> model_path <- h2o.saveModelDetails( object=gbm, path='.', force=TRUE )
> model_path
[1] "/home/bac/gbm_34325f.hex.json"

# file created:
# -rw-rw-r-- 1 bac bac      552K Apr  2 12:20 gbm_34325f.hex.json
#
# first few characters are:
# {"__meta":{"schema_version":3,"schema_name":"GBMModelV3","schema_type":"GBMModel"},

> h2o.saveMojo( gbm, path='.', force=TRUE )
[1] "/home/bac/gbm_34325f.hex.zip"

# file created:
# -rw-rw-r-- 1 bac bac   7120899 Apr  2 11:57 gbm_34325f.hex.zip
#
# when I unzip this file, things look okay (altho MOJOs are new to me).

> h2o.saveModel( gbm, path='.', force=TRUE )
[ this hangs and never returns; i have to kill the entire R session ]

# empty file created:
# -rw-rw-r-- 1 bac bac         0 Apr  2 12:00 gbm_34325f.hex

I then access this hung-up process via the web interface H2OFlow. 然后，我通过Web界面H2OFlow访问这个挂起的进程。 Again, I can load and view the model. 我再次加载并查看模型。 When I try to export the model, an empty .hex file is created and I see the message: 当我尝试导出模型时，会创建一个空的.hex文件，我看到了以下消息：

Waiting for 2 responses...

( 2 responses because I exported twice.) （ 2 responses因为我出口了两次。）

To be clear, I am not loading an old model. 要清楚，我没有加载旧模型。 Rather, I am rebuilding the model from scratch in the new R/H2O environment. 相反，我正在新的R / H2O环境中从头开始重建模型。 I am, however, using the same R/H2O code that was successful in the older environment. 但是，我使用的是在旧环境中成功的相同R / H2O代码。

Any ideas of what is going on? 对于发生了什么的任何想法？ Thanks. 谢谢。

UPDATE: 更新：

The problem I have -- h2o.saveModel hangs -- is related to OOM (out of memory). 我h2o.saveModel的问题 - h2o.saveModel挂起 - 与OOM （内存不足）有关。

I see these messages in the .out file created when I h2o.init : 我在h2o.init创建的.out文件中看到这些消息：

Note:  In case of errors look at the following log files:
    /tmp/RtmpOnJn83/h2o_bfo7328_started_from_r.out
    /tmp/RtmpOnJn83/h2o_bfo7328_started_from_r.err

$ tail -n 6 h2o_bfo7328_started_from_r.out
[ I removed the timestamp / IP info to help made this readable ]

FJ-1-107  INFO:  2017-04-04 01:27:04 30 min 56.196 sec            6000       0.25485          0.22119      0.96950       3.54582                       0.08634
2946-780 INFO: GET /3/Models/gbm_34325f.hex, parms: {}
2946-780 INFO: GET /3/Models/gbm_34325f.hex, parms: {}
946-1102 INFO: GET /99/Models.bin/gbm_34325f.hex, parms: {dir=/opt/app/STUFF/bpci/training/facility_models/gbm_34325f.hex, force=TRUE}
946-1102 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:3.15 GB + POJO:Zero   + FREE:441.54 GB == MEM_MAX:444.44 GB), desiredKV=299.74 GB OOM!
946-1102 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:3.15 GB + POJO:Zero   + FREE:441.54 GB == MEM_MAX:444.44 GB), desiredKV=299.74 GB OOM!

Once I realized this was an OOM issue, I changed my h2o.init to include max_mem_size : 一旦我意识到这是一个OOM问题，我改变了我的h2o.init以包含max_mem_size ：

localH2O = h2o.init(ip = "localhost", port = 54321, nthreads = -1, max_mem_size = '500G')

Even with max_mem_size = '500G' set this high, I still get a OOM error (see above). 即使max_mem_size = '500G'设置为高，我仍然会收到OOM错误（见上文）。

When I was running H2O v3.6.0.8, I didn't explicitly define max_mem_size . 当我运行H2O v3.6.0.8时，我没有明确定义max_mem_size 。
I am curious: Now that I've upgraded to H2O v3.10.4.2, is there a larger memory demand? 我很好奇：现在我升级到H2O v3.10.4.2，是否有更大的内存需求？ What was the default max_mem_size in H2O v3.6.0.8? H2O v3.6.0.8中的默认max_mem_size是什么？

Any idea of what changed memory-wise between the two versions of H2O? 想知道两个版本的H2O之间的内存改变了什么？ And how I can get this to run again? 我怎么能让它再次运行？

Thanks! 谢谢！

11Apr2017 UPDATE: 11Apr2017更新：

I hoped to share the dataset that generates this error. 我希望共享生成此错误的数据集。 Unfortunately, the data contains protected information so I cannot share it. 不幸的是，数据包含受保护的信息，因此我无法共享它。 I created a 'scrubbed' version of this file -- contains nonsense data -- but I found it much too difficult to run this scrubbed data through our model training R code because of various dependencies and validation checks. 我创建了这个文件的“擦洗”版本 - 包含无意义的数据 - 但我发现由于各种依赖性和验证检查，通过我们的模型训练R代码运行这个擦除数据太困难了。

I have a general sense of what sorts of parameters cause the OOM (out of memory) error during h2o.saveModel . 我一般意识到在h2o.saveModel期间哪种类型的参数导致OOM（内存h2o.saveModel ）错误。
Causes errors: 导致错误：

51380 records with 1413 columns of data used to train 51380条记录，包含1413列用于训练的数据
ntrees = 6000 ntrees = 6000

Does not cause errors: 不会导致错误：

51380 records with 1413 columns of data used to train 51380条记录，包含1413列用于训练的数据
ntrees = 3750 (but ntrees = 4000 causes an error) ntrees = 3750（但是ntrees = 4000会导致错误）

Does not cause errors: 不会导致错误：

25000 records with 1413 columns of data used to train (but 40000 records causes an error) 25000条记录，包含1413列用于训练的数据（但40000条记录导致错误）
ntrees = 6000 ntrees = 6000

There is some combination of number of records, number of columns, and ntrees that eventually causes OOM. 有一些记录数，列数和最终导致OOM的树的组合。

Setting max_mem_size does not help at all. 设置max_mem_size根本没有用。 I set it to '100G', '200G', and '300G' and still OOM during h2o.saveModel . 我将它设置为'100G'，'200G'和'300G'，并且在h2o.saveModel期间仍然是OOM。

Testing earlier versions of H2O 测试早期版本的H2O

Because I cannot compromise on number of records and number of columns used to train and on the number of trees needed in the GBM, I had to go back to an earlier version of h2o. 因为我不能在记录的数量和用于训练的列数以及GBM中所需的树数量上妥协，所以我不得不回到早期版本的h2o。

After working with ten different versions of h2o, I found the most recent released version that does not produce OOM. 在使用十个不同版本的h2o之后，我找到了最新发布的版本，它不会产生OOM。 The versions and the results are: 版本和结果是：

v3.6.0.8 - success (original version used to create model) v3.6.0.8 - 成功（用于创建模型的原始版本）
v3.8.1.4 - success v3.8.1.4 - 成功
v3.10.0.8 - success v3.10.0.8 - 成功
v3.10.2.1 - success v3.10.2.1 - 成功
v3.10.3.1 - error: OOM v3.10.3.1 - 错误：OOM
v3.10.3.2 - error: OOM v3.10.3.2 - 错误：OOM
v3.10.3.5 - error: OOM v3.10.3.5 - 错误：OOM
v3.10.4.2 - error: OOM (upgraded to this; found OOM error) v3.10.4.2 - 错误：OOM（升级到此;发现OOM错误）
v3.10.4.3 - error: OOM v3.10.4.3 - 错误：OOM
v3.11.0.3839 - success v3.11.0.3839 - 成功

I am not using v3.11.0.3839 since it seems to be 'bleeding edge'. 我没有使用v3.11.0.3839，因为它似乎是“前沿”。 I am currently running with v3.10.2.1. 我目前正在运行v3.10.2.1。

I hope this helps someone track down this bug. 我希望这可以帮助有人追踪这个错误。

23Jun2017 UPDATE: 23Jun2017更新：

I was able to fix this problem by: 我能够解决这个问题：

upgrading to v3.10.5.1 升级到v3.10.5.1
setting both min_mem_size and max_mem_size during h2o.init() 同时设置min_mem_size和max_mem_size期间h2o.init()

See: https://stackoverflow.com/a/44724813/7733787 请参阅： https ： //stackoverflow.com/a/44724813/7733787

Answer 1

As this problem is directly related with memory let have you set memory properly for your h2o instance and make sure the setting is working. 由于此问题与内存直接相关，请让您为h2o实例正确设置内存并确保设置正常。 As you are setting max_mem_size randomly to an arbitrary number (100g, 200g, 300g) it is not going to help. 当你将max_mem_size随机设置为任意数字（100g，200g，300g）时，它无济于事。 First we need to know total RAM in your machine and then you can use about 80% of this memory for your h2o instance. 首先，我们需要了解您机器中的总RAM，然后您可以将大约80％的内存用于您的h2o实例。

For example I have 16GB in my machine and I want to give 12GB for H2O instance when started from RI will do the following: 例如，我的机器中有16GB，我想从RI启动时为H2O实例提供12GB将执行以下操作：

h2o.init(max_mem_size = "12g")

Once H2O is up and running I will get confirmation of memory set for H2O process as below: 一旦H2O启动并运行，我将获得H2O过程的内存设置确认，如下所示：

R is connected to the H2O cluster: 
H2O cluster uptime:         2 seconds 166 milliseconds 
H2O cluster version:        3.10.4.3 
H2O cluster version age:    12 days  
H2O cluster name:           H2O_started_from_R_avkashchauhan_kuc791 
H2O cluster total nodes:    1 
H2O cluster total memory:   10.67 GB <=== [memory setting working]
H2O cluster total cores:    8 
H2O cluster allowed cores:  2 
H2O cluster healthy:        TRUE 
H2O Connection ip:          localhost 
H2O Connection port:        54321 
H2O Connection proxy:       NA 
H2O Internal Security:      FALSE 
R Version:                  R version 3.3.2 (2016-10-31)

If you change your dataset size during various model building step you will see OOM with random row count because sometime Java GC will clear the unused memory and sometimes waiting to clear. 如果在各种模型构建步骤中更改数据集大小，您将看到具有随机行数的OOM，因为有时Java GC将清除未使用的内存，有时等待清除。 So you will hit OOM once with N number and sometime you will not hit OOM with 2N numbers in the same java instance. 因此，您将使用N编号命中OOM一次，有时您不会在同一个Java实例中使用2N编号命中OOM。 So chasing that route is not useful. 所以追逐那条路线是没用的。

It definitely a memory related issue and make sure you give good enough memory to H2O cluster and then see how it works. 它肯定是一个与内存相关的问题，并确保你给H2O集群提供足够的内存，然后看看它是如何工作的。

为什么h2o.saveModel挂在R v3.3.2和H2O v3.10.4.2中

问题描述

UPDATE: 更新：

11Apr2017 UPDATE: 11Apr2017更新：

Testing earlier versions of H2O 测试早期版本的H2O

23Jun2017 UPDATE: 23Jun2017更新：

1 个解决方案

解决方案1
0 2017-04-13 00:34:52

为什么h2o.saveModel挂在R v3.3.2和H2O v3.10.4.2中

问题描述

UPDATE: 更新：

11Apr2017 UPDATE: 11Apr2017更新：

Testing earlier versions of H2O 测试早期版本的H2O

23Jun2017 UPDATE: 23Jun2017更新：

1 个解决方案

解决方案1 0 2017-04-13 00:34:52

解决方案1
0 2017-04-13 00:34:52