增加R中的记忆

Question

I'm working with a large data set (41,000 observations and 22 predictor variables) and trying to fit a Random Forest model using this code: 我正在使用大型数据集（41,000个观测值和22个预测变量）并尝试使用以下代码拟合随机森林模型：

model <- randomForest(as.factor(data$usvsa) ~ ., ntree=1000, importance=TRUE, + proximity=TRUE, data=data). model < - randomForest（as.factor（data $ usvsa）〜。，ntree = 1000，importance = TRUE，+ proximity = TRUE，data = data）。

I am running into the following error: 我遇到以下错误：

Error: cannot allocate vector of size 12.7 Gb
In addition: Warning messages:
1: In matrix(0, n, n) :
  Reached total allocation of 6019Mb: see help(memory.size)
2: In matrix(0, n, n) :
  Reached total allocation of 6019Mb: see help(memory.size)
3: In matrix(0, n, n) :
  Reached total allocation of 6019Mb: see help(memory.size)
4: In matrix(0, n, n) :
  Reached total allocation of 6019Mb: see help(memory.size)

I have done some reading in the R help on memory limits and on this site and am thinking that I need to buy 12+ GB of RAM since my memoryLimit is already set to about 6GB of RAM (my computer only has 6 GB of RAM). 我已经在内存限制的R帮助和本网站上做了一些阅读，并且我认为我需要购买12 GB以上的RAM，因为我的memoryLimit已经设置为大约6GB的RAM（我的计算机只有6 GB的RAM）。 But first I wanted to double check that this is the only solution. 但首先我想仔细检查这是唯一的解决方案。 I am running a windows 7 with a 64 bit processor and 6GB of RAM. 我正在运行带有64位处理器和6GB RAM的Windows 7。 Here is the R sessionInfo: 这是R sessionInfo：

sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] randomForest_4.6-7

loaded via a namespace (and not attached):
[1] tools_2.15.3

Any tips? 有小费吗？

Answer 1

Quoting from the wonderful paper "Big Data: New Tricks for Econometrics" by Hal Varian : 引自Hal Varian的精彩论文“大数据：计量经济学的新技巧” ：

If the extracted data is still inconveniently large, it is often possible to select a subsample for statistical analysis. 如果提取的数据仍然不方便地大，则通常可以选择子样本用于统计分析。 At Google, for example, I have found that random samples on the order of 0.1 percent work for analysis of economic data. 例如，在谷歌，我发现大约0.1％的随机样本用于分析经济数据。

So how about if you don't use all 41k rows and 22 predictors? 那么如果你不使用所有41k行和22个预测变量呢？

Answer 2

Yes, you simply need to buy more RAM. 是的，你只需要购买更多内存。 By default R will use all the memory available to it (at least on osx and linux ) 默认情况下，R将使用它可用的所有内存（至少在osx和linux ）

Answer 3

The solution to your problem is actually pretty simple, and you don't have to sacrifice the quality of your analysis or invest into local RAM (which still may turn out to be insufficient). 问题的解决方案实际上非常简单，您不必牺牲分析的质量或投资本地RAM（这仍然可能不足）。 Simply make use of cloud computing services, such as Amazon's AWS or whichever provider you choose. 只需使用云计算服务，例如亚马逊的AWS或您选择的任何提供商。

Basically, you rent a virtual machine, which has dynamic RAM. 基本上，您租用一台具有动态RAM的虚拟机。 It can expand as you need, I've been using a 64Gb RAM server at one point. 它可以根据需要进行扩展，我一直在使用64Gb RAM服务器。 Choose Linux, install R and libraries, upload your data and scripts, run your analysis. 选择Linux，安装R和库，上传数据和脚本，运行分析。 If it completes quickly, the whole procedure will not cost much (most likely under $10). 如果它快速完成，整个过程将不会花费太多（最有可能低于10美元）。 Good luck! 祝好运！

增加R中的记忆

问题描述

3 个解决方案

解决方案1
2 2013-12-10 23:05:12

解决方案2
1 2013-12-10 22:51:31

解决方案3
1 2013-12-11 10:53:09

增加R中的记忆

问题描述

3 个解决方案

解决方案1 2 2013-12-10 23:05:12

解决方案2 1 2013-12-10 22:51:31

解决方案3 1 2013-12-11 10:53:09

解决方案1
2 2013-12-10 23:05:12

解决方案2
1 2013-12-10 22:51:31

解决方案3
1 2013-12-11 10:53:09