H2O：无法通过h2o.loadModel从磁盘读取大模型

Question

UPDATED 28Jun2017, below, in response to @Michal Kurka. 于2017年6月28日更新，以回应@Michal Kurka。
UPDATED 26Jun2017, below. 于2017年6月26日更新。

I am unable to load a large GBM model that I saved in native H2O format (ie, hex). 我无法加载以本机H2O格式（即，十六进制）保存的大型GBM模型。

H2O v3.10.5.1 H2O v3.10.5.1
R v3.3.2 R v3.3.2
Linux 3.10.0-327.el7.x86_64 GNU/Linux Linux 3.10.0-327.el7.x86_64 GNU / Linux

My goal is to eventually save this model as MOJO. 我的目标是最终将此模型另存为MOJO。

This model was so large that I had to initialize H2O with min/max memory 100G/200G before H2O's model training would run successfully. 该模型太大，以至于在H2O的模型训练成功运行之前，我必须使用最小/最大内存100G / 200G初始化H2O。

This is how I trained the GBM model: 这就是我训练GBM模型的方式：

localH2O <- h2o.init(ip = 'localhost', port = port, nthreads = -1,
                     min_mem_size = '100G', max_mem_size = '200G')

iret <- h2o.gbm(x = predictors, y = response, training_frame = train.hex,
                validation_frame = holdout.hex, distribution="multinomial",
                ntrees = 3000, learn_rate = 0.01, max_depth = 5, nbins = numCats,
                model_id = basename_model)

gbm <- h2o.getModel(basename_model)
oPath <- h2o.saveModel(gbm, path = './', force = TRUE)

The training data contains 81,886 records with 1413 columns. 训练数据包含81886条记录，共1413列。 Of these columns, 19 are factors. 在这些列中，有19个是因子。 The vast majority of these columns are 0/1. 这些列的绝大多数为0/1。

$ wc -l training/*.txt
     81887 training/train.txt
     27294 training/holdout.txt

This is the saved model as written to disk: 这是保存到磁盘的已保存模型：

$ ls -l

total 37G
-rw-rw-r-- 1 bfo7328 37G Jun 22 19:57 my_model.hex

This is how I tried to read the model from disk using the same large memory allocation values 100G/200G: 这就是我尝试使用相同的大内存分配值100G / 200G从磁盘读取模型的方式：

$ R

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

> library(h2o)
> localH2O=h2o.init(ip='localhost', port=65432, nthreads=-1,
                  min_mem_size='100G', max_mem_size='200G')

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out
    /tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.err

openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         3 seconds 550 milliseconds 
    H2O cluster version:        3.10.5.1 
    H2O cluster version age:    13 days  
    H2O cluster name:           H2O_started_from_R_bfo7328_kmt050 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   177.78 GB 
    H2O cluster total cores:    64 
    H2O cluster allowed cores:  64 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        65432 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.2 (2016-10-31)

From /tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out : 从/tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out ：

INFO: Processed H2O arguments: [-name, H2O_started_from_R_bfo7328_kmt050, -ip, localhost, -port, 65432, -ice_root, /tmp/RtmpVSwxXR]
INFO: Java availableProcessors: 64
INFO: Java heap totalMemory: 95.83 GB
INFO: Java heap maxMemory: 177.78 GB
INFO: Java version: Java 1.8.0_121 (from Oracle Corporation)
INFO: JVM launch parameters: [-Xms100G, -Xmx200G, -ea]
INFO: OS version: Linux 3.10.0-327.el7.x86_64 (amd64)
INFO: Machine physical memory: 1.476 TB

My call to h2o.loadModel : 我对h2o.loadModel调用：

if ( TRUE ) {
  now <- format(Sys.time(), "%a %b %d %Y %X")
  cat( sprintf( 'Begin %s\n', now ))

  model_filename <- './my_model.hex'
  in_model.hex <- h2o.loadModel( model_filename )

  now <- format(Sys.time(), "%a %b %d %Y %X")
  cat( sprintf( 'End   %s\n', now ))
}

From /tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out : 从/tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out ：

INFO: GET /, parms: {}
INFO: GET /, parms: {}
INFO: GET /, parms: {}
INFO: GET /3/InitID, parms: {}
INFO: Locking cloud to new members, because water.api.schemas3.InitIDV3
INFO: POST /99/Models.bin/, parms: {dir=./my_model.hex}

After waiting an hour, I see these "out of memory" (OOM) error messages: 等待一个小时后，我看到以下“内存不足”（OOM）错误消息：

INFO: POST /99/Models.bin/, parms: {dir=./my_model.hex}
#e Thread WARN: Swapping!  GC CALLBACK, (K/V:24.86 GB + POJO:112.01 GB + FREE:40.90 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping!  GC CALLBACK, (K/V:26.31 GB + POJO:118.41 GB + FREE:33.06 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping!  GC CALLBACK, (K/V:27.36 GB + POJO:123.03 GB + FREE:27.39 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping!  GC CALLBACK, (K/V:28.21 GB + POJO:126.73 GB + FREE:22.83 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!

I would not expect to need so much memory to read the model from disk. 我不希望需要那么多的内存来从磁盘读取模型。

How can I read this model from disk into memory. 如何从磁盘将这种模型读取到内存中。 And once I do, can I save it as a MOJO? 然后，我可以将其另存为MOJO吗？

UPDATE 1: 26Jun2017 更新1：26Jun2017

I just noticed that the disk size of a GBM model increased dramatically between versions of H2O: 我刚刚注意到，GBM模型的磁盘大小在H2O版本之间急剧增加：

 H2O v3.10.2.1: -rw-rw-r-- 1 169M Jun 19 07:23 my_model.hex H2O v3.10.5.1: -rw-rw-r-- 1 37G Jun 22 19:57 my_model.hex

Any ideas why? 有什么想法吗？ Could this be the root of the problem? 这可能是问题的根源吗？

UPDATE 2: 28Jun2017 in response to comments by @Michal Kurka. 更新2：2017年6月28日回应@Michal Kurka的评论。

When I load the training data via fread , the class (type) of each column is correct: * 24 columns are 'character'; 当我通过fread加载训练数据时，每列的类（类型）是正确的：* 24列是“字符”； * 1389 columns are 'integer' (all but one column are 0/1); * 1389列为“整数”（除一列外均为0/1）； * 1413 total columns. *共1413列。

I then convert the R-native data frame to an H2O data frame and manually factor-ize 20 columns: 然后，我将R-native数据帧转换为H2O数据帧，并手动分解20列：

 - attr(*, "nrow")= int 81886
 - attr(*, "ncol")= int 1413
 - attr(*, "types")=List of 1413
  ..$ : chr "enum" : Factor w/ 72 levels
  ..$ : chr "enum" : Factor w/ 77 levels
  ..$ : chr "enum" : Factor w/ 51 levels
  ..$ : chr "enum" : Factor w/ 4226 levels
  ..$ : chr "enum" : Factor w/ 4183 levels
  ..$ : chr "enum" : Factor w/ 3854 levels
  ..$ : chr "enum" : Factor w/ 3194 levels
  ..$ : chr "enum" : Factor w/ 735 levels
  ..$ : chr "enum" : Factor w/ 133 levels
  ..$ : chr "enum" : Factor w/ 16 levels
  ..$ : chr "enum" : Factor w/ 25 levels
  ..$ : chr "enum" : Factor w/ 647 levels
  ..$ : chr "enum" : Factor w/ 715 levels
  ..$ : chr "enum" : Factor w/ 679 levels
  ..$ : chr "enum" : Factor w/ 477 levels
  ..$ : chr "enum" : Factor w/ 645 levels
  ..$ : chr "enum" : Factor w/ 719 levels
  ..$ : chr "enum" : Factor w/ 678 levels
  ..$ : chr "enum" : Factor w/ 478 levels

A condensed version of the output from str(train.hex) , showing only those 19 columns that are factors (1 factor is the response column): 来自str(train.hex)的输出的精简版本，仅显示作为因子的那19列（1因子为响应列）：

  - attr(*, "nrow")= int 81886 - attr(*, "ncol")= int 1413 - attr(*, "types")=List of 1413 ..$ : chr "enum" : Factor w/ 72 levels ..$ : chr "enum" : Factor w/ 77 levels ..$ : chr "enum" : Factor w/ 51 levels ..$ : chr "enum" : Factor w/ 4226 levels ..$ : chr "enum" : Factor w/ 4183 levels ..$ : chr "enum" : Factor w/ 3854 levels ..$ : chr "enum" : Factor w/ 3194 levels ..$ : chr "enum" : Factor w/ 735 levels ..$ : chr "enum" : Factor w/ 133 levels ..$ : chr "enum" : Factor w/ 16 levels ..$ : chr "enum" : Factor w/ 25 levels ..$ : chr "enum" : Factor w/ 647 levels ..$ : chr "enum" : Factor w/ 715 levels ..$ : chr "enum" : Factor w/ 679 levels ..$ : chr "enum" : Factor w/ 477 levels ..$ : chr "enum" : Factor w/ 645 levels ..$ : chr "enum" : Factor w/ 719 levels ..$ : chr "enum" : Factor w/ 678 levels ..$ : chr "enum" : Factor w/ 478 levels

The above results are exactly the same between v3.10.2.1 (smaller model written to disk: 169M) and v3.10.5.1 (larger model written to disk: 37G). 在v3.10.2.1（写入磁盘的较小型号：169M）和v3.10.5.1（写入磁盘的较大型号：37G）之间，上述结果完全相同。

The actual GBM training uses nbins <- 37 : 实际的GBM培训使用的nbins <- 37 ：

 numCats <- n_distinct(as.matrix(dplyr::select_(df.train,response))) numCats [1] 37 iret <- h2o.gbm(x = predictors, y = response, training_frame = train.hex, validation_frame = holdout.hex, distribution="multinomial", ntrees = 3000, learn_rate = 0.01, max_depth = 5, nbins = numCats, model_id = basename_model)

Answer 1

The difference in size of the models (169M vs 37G) is surprising. 型号的大小差异（169M与37G）令人惊讶。 Can you please make sure that H2O recognizes all your numeric columns as numeric and not categorical with very high cardinality? 您能否确保H2O将所有数字列都识别为数字列，而不是基数很高的分类列？

Do you use automatic detection of column types or do you specify it manually? 您使用列类型的自动检测还是手动指定？

H2O：无法通过h2o.loadModel从磁盘读取大模型

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-06-27 13:42:53

H2O：无法通过h2o.loadModel从磁盘读取大模型

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-06-27 13:42:53

解决方案1
0 已采纳 2017-06-27 13:42:53