简体繁体 English

使用combine（）和R包randomForest

[英]Using combine() with the R Package randomForest

原文 2015-09-19 15:24:12 6 1 r/ random-forest

I'm working with a very large set of data, about 120,000 rows and 34 columns. 我正在处理大量数据，大约120,000行和34列。 As you can well image, when using the R package randomForest, the program takes quite a number of hours to run, even on a powerful Windows server. 正如您可以很好地成像，当使用R包randomForest时，即使在功能强大的Windows服务器上，该程序也需要相当长的时间才能运行。

Although I am no expert in randomForest, I have a question about the proper use of the combine() function. 虽然我不是randomForest的专家，但我有一个关于正确使用combine（）函数的问题。

I seem to get conflicting answers when I researched this question online. 当我在网上研究这个问题时，我似乎得到了相互矛盾的答案。 Some say that you can only use combine() when using randomForest on the same set of data. 有人说你在同一组数据上使用randomForest时只能使用combine（）。 Others say that you can just use combine(). 其他人说你可以使用combine（）。

What I'd like (hope, dream) to do is break up the 120,000 rows of data into 6 data frames, each containing 20,000 rows and perform randomForest on each of the 6 data frames. 我希望（希望，梦想）做的是将120,000行数据分成6个数据帧，每个数据帧包含20,000行，并在6个数据帧的每一个上执行randomForest。 My hope is that I can use the combine() function to then combine the results of all 6 together. 我的希望是我可以使用combine（）函数然后将所有6个的结果组合在一起。 Is that possible? 那可能吗？

Any help in this matter would be greatly appreciated. 任何有关此事的帮助将不胜感激。

1 个解决方案

a couple of hours seems a lot of time. 几个小时似乎很多时间。 Are you sure you are running on an optimized machine? 您确定在优化的机器上运行吗？ Perhaps you could experiment on Linux and AWS EC2. 也许您可以在Linux和AWS EC2上进行实验。 Also check out ranger which has been out since a couple of weeks http://arxiv.org/abs/1508.04409 and https://cran.r-project.org/web/packages/ranger/index.html 另请查看自几周以来已经出去的ranger http://arxiv.org/abs/1508.04409和https://cran.r-project.org/web/packages/ranger/index.html