增加样本数据的大小 - R.

Question

One of my colleagues indicated that randomForest() does not perform well with very large data sets. 我的一位同事指出， randomForest()对于非常大的数据集表现不佳。 Now, I am just trying to figure out if that really is the case, but since the data set cannot be shared (sensitive information), I thought I might as well try to come up with a large data set. 现在，我只想弄清楚是否真的如此，但由于无法共享数据集（敏感信息），我想我不妨试着想出一个大数据集。 I have tried following, but cannot make sense of the error message: 我试过以下，但无法理解错误信息：

library(randomForest)
data(iris)
dataFile <- iris
newdataFile <- dataFile[sample(dataFile, size= 1:1000000000, replace=T),]

Error message: 错误信息：

Error in xj[i] : invalid subscript type 'list'

Can anyone please guide me here ? 有人可以在这里指导我吗？

Answer 1

sample accepts a vector. sample接受一个向量。 When sampling from a data.frame, one usually samples the rows by referring to them as a number, much akin to subsetting but in this case, with replacement. 从data.frame中采样时，通常会将行作为数字对行进行采样，这类似于子集，但在这种情况下，需要替换。

newdataFile <- iris[sample(nrow(iris),100000,replace=T),]

Answer 2

The assertion that Random Forests does not perform well with large datasets is absurd. 随机森林对大型数据集表现不佳的断言是荒谬的。 It is notably well suited to high dimensional problems both from a sample size and multivariate standpoint. 从样本大小和多变量角度来看，它非常适合高维度问题。 The primary issues with RF and very large problems are: 1) tractability and 2) sample balance. RF和非常大的问题的主要问题是：1）易处理性和2）样本平衡。

If you have a problem where one class is proportionally larger (>30%) then the bootstrap can be biased and the OOB validation, and possibly the estimate, is incorrect. 如果您遇到一个问题，其中一个类比例较大（> 30％），则引导程序可能会有偏差，并且OOB验证（可能是估计值）不正确。 The result, of say a binary problem with [0=10000,1=200], would be a very high prediction rate to 0 and very low to 1 resulting in a very good, but quite inflated, OOB error rate for the model but very poor performance for class 1. 结果，例如[0 = 10000,1 = 200]的二进制问题，将是一个非常高的预测率为0，非常低到1，导致模型的OOB错误率非常好但非常高，但是1级表现非常糟糕。

This is obviously not representative of the model performance and you will have very low prediction prevalence for class 1. If you have a class balance issue I would follow the methodologies in either Chen et. 这显然不能代表模型的表现，你对第1类的预测流行率很低。如果你有类平衡问题，我会遵循陈等人的方法。 al., (2004) or Evans & Cushman (2009). al。，（2004）或Evans＆Cushman（2009）。

Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. Chen C，Liaw A，Breiman L（2004）使用随机森林来学习不平衡数据。 http://www.stat.berkeley.edu/tech-reports/666.pdf http://www.stat.berkeley.edu/tech-reports/666.pdf

Evans, JS and SA Cushman (2009) Gradient Modeling of Conifer Species Using Random Forests. Evans，JS和SA Cushman（2009）使用随机森林的针叶树种的梯度建模。 Landscape Ecology 5:673-683. 景观生态学5：673-683。

增加样本数据的大小 - R.

问题描述

2 个解决方案

解决方案1
2 已采纳 2012-10-22 17:50:42

解决方案2
2 2012-10-22 20:12:04

增加样本数据的大小 - R.

问题描述

2 个解决方案

解决方案1 2 已采纳 2012-10-22 17:50:42

解决方案2 2 2012-10-22 20:12:04

解决方案1
2 已采纳 2012-10-22 17:50:42

解决方案2
2 2012-10-22 20:12:04