简体   繁体   English

增加样本数据的大小 - R.

[英]Increasing the size of the sample data - R

One of my colleagues indicated that randomForest() does not perform well with very large data sets. 我的一位同事指出, randomForest()对于非常大的数据集表现不佳。 Now, I am just trying to figure out if that really is the case, but since the data set cannot be shared (sensitive information), I thought I might as well try to come up with a large data set. 现在,我只想弄清楚是否真的如此,但由于无法共享数据集(敏感信息),我想我不妨试着想出一个大数据集。 I have tried following, but cannot make sense of the error message: 我试过以下,但无法理解错误信息:

library(randomForest)
data(iris)
dataFile <- iris
newdataFile <- dataFile[sample(dataFile, size= 1:1000000000, replace=T),]

Error message: 错误信息:

Error in xj[i] : invalid subscript type 'list'

Can anyone please guide me here ? 有人可以在这里指导我吗?

sample accepts a vector. sample接受一个向量。 When sampling from a data.frame, one usually samples the rows by referring to them as a number, much akin to subsetting but in this case, with replacement. 从data.frame中采样时,通常会将行作为数字对行进行采样,这类似于子集,但在这种情况下,需要替换。

newdataFile <- iris[sample(nrow(iris),100000,replace=T),]

The assertion that Random Forests does not perform well with large datasets is absurd. 随机森林对大型数据集表现不佳的断言是荒谬的。 It is notably well suited to high dimensional problems both from a sample size and multivariate standpoint. 从样本大小和多变量角度来看,它非常适合高维度问题。 The primary issues with RF and very large problems are: 1) tractability and 2) sample balance. RF和非常大的问题的主要问题是:1)易处理性和2)样本平衡。

If you have a problem where one class is proportionally larger (>30%) then the bootstrap can be biased and the OOB validation, and possibly the estimate, is incorrect. 如果您遇到一个问题,其中一个类比例较大(> 30%),则引导程序可能会有偏差,并且OOB验证(可能是估计值)不正确。 The result, of say a binary problem with [0=10000,1=200], would be a very high prediction rate to 0 and very low to 1 resulting in a very good, but quite inflated, OOB error rate for the model but very poor performance for class 1. 结果,例如[0 = 10000,1 = 200]的二进制问题,将是一个非常高的预测率为0,非常低到1,导致模型的OOB错误率非常好但非常高,但是1级表现非常糟糕。

This is obviously not representative of the model performance and you will have very low prediction prevalence for class 1. If you have a class balance issue I would follow the methodologies in either Chen et. 这显然不能代表模型的表现,你对第1类的预测流行率很低。如果你有类平衡问题,我会遵循陈等人的方法。 al., (2004) or Evans & Cushman (2009). al。,(2004)或Evans&Cushman(2009)。

Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. Chen C,Liaw A,Breiman L(2004)使用随机森林来学习不平衡数据。 http://www.stat.berkeley.edu/tech-reports/666.pdf http://www.stat.berkeley.edu/tech-reports/666.pdf

Evans, JS and SA Cushman (2009) Gradient Modeling of Conifer Species Using Random Forests. Evans,JS和SA Cushman(2009)使用随机森林的针叶树种的梯度建模。 Landscape Ecology 5:673-683. 景观生态学5:673-683。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM