简体繁体 English

将来自两个随机森林模型的树模型合并到R中H2O的一个随机森林模型中

[英]Merging Tree Models from two random forest models into one random forest model at H2O in R

原文 2018-04-16 21:17:04 0 2 r/ machine-learning/ parallel-processing/ h2o

I am relatively new to the machine learning ocean, please excuse me if some of my questions are really basic. 我对机器学习海洋比较陌生，如果我的一些问题非常基本，请原谅。

Current situation: The overall goal was trying to improve some code for h2o package in r running on the supercomputer cluster. 当前情况：总体目标是尝试改进在超级计算机集群上运行的h2o包的一些代码。 However, since the data is too large that single node with h2o really takes more than a day, therefore, we have decided to use multiple nodes to run the model. 但是，由于数据太大以至于具有h2o的单个节点实际上需要超过一天，因此，我们决定使用多个节点来运行模型。 I came up with an idea: 我提出了一个想法：

(1) Distribute each node to build (nTree/num_node) trees and saved into a model; （1）分配每个节点构建（nTree / num_node）树并保存到模型中;

(2) running on the cluster at each node for (nTree/num_node) number of trees in the forest; （2）在每个节点的集群上运行（nTree / num_node）森林中的树数;

(3) Merging the trees back together and reform the original forest, and using the measurement results in average. （3）将树木合并，改造原始森林，平均测量结果。

I later realized this could be risky. 我后来意识到这可能有风险。 But I cannot find the actual support or against statement since I am not machine learning focused programmer. 但我找不到实际的支持或反对声明，因为我不是机器学习的重点程序员。

Questions: 问题：

if this way of handling random forest will result in some risk, please reference me the link so I can have a basic idea why this is not right. 如果这种处理随机森林的方式会导致一些风险，请参考我的链接，以便我可以基本了解为什么这是不对的。
If this way is actually an "ok" way to do so. 如果这种方式实际上是一种“好”的方式。 What should I be do to merge the trees, is there a package or method I can borrow from? 我应该怎么做才能合并树木，我可以借用一个包裹或方法吗？
If this is actually a solved problem, please reference me the link, I may have searched the wrong keywords, and thank you! 如果这实际上是一个已解决的问题，请参考我链接，我可能搜索了错误的关键字，谢谢！

The real number-involved example I can present here is: 我可以在这里介绍的真实数字示例是：

I have a random forest task with 80k rows and 2k columns and wanted the number of trees are 64. What I have done is put 16 trees on each node running with the whole dataset, and each one of four nodes come up with an RF model. 我有一个80k行和2k列的随机森林任务，并希望树的数量为64.我所做的是在每个节点上运行16个树与整个数据集一起运行，并且四个节点中的每一个都提供一个RF模型。 I am now trying to merge the trees from each model into this one big RF model and average the measurements (from each of those four models). 我现在正在尝试将每个模型中的树合并到这个大型RF模型中并平均测量值（来自这四个模型中的每一个）。

2 个解决方案

There is no need to merge the models. 无需合并模型。 Unlike with boosting methods, every tree in a Random Forest is grown independently (just don't set the same seed prior to kicking off RF on each node!). 与增强方法不同，随机森林中的每棵树都是独立生长的（只是在每个节点上开始射频之前不要设置相同的种子！）。

You are basically doing what Random Forest does on its own, which is to grow X independent trees and then average across the votes. 你基本上是在做随机森林自己做的事情，即增长X个独立的树，然后平均投票。 Many packages provide an option to specify the number of cores or threads, in order to take advantage of this feature of RF. 许多软件包提供了指定核心或线程数量的选项，以便利用RF的这一特性。

In your case, since you have the same number of trees per node, you'll get 4 "models" back, but those are really just collections of 16 trees. 在您的情况下，由于每个节点拥有相同数量的树，因此您将获得4个“模型”，但这些只是16个树的集合。 To use it, I'd just keep the 4 models separate and when you want a prediction, average the prediction from each of the 4 models. 为了使用它，我只是将4个模型分开，当你想要预测时，平均4个模型中的每个模型的预测。 Assuming you're going to be doing that more than once, you could write a small wrapper function to predict with the 4 models and average the output. 假设您不止一次这样做，您可以编写一个小的包装函数来预测4个模型并平均输出。

10,000 rows by 1,000 columns is not overly large and should not take that long to train an RF model. 10,000行乘1,000列不会过大，不应花那么长时间来训练RF模型。

It sound like something unexpected is happening. 听起来好像发生了意想不到的事情。

While you can try to average models if you know what you are doing, I don't think it should be necessary in this case. 如果您知道自己在做什么，可以尝试平均模型，但我不认为在这种情况下应该是必要的。