R中隨機森林的分層抽樣

Question

我在randomForest的文檔中閱讀了以下randomForest ：

strata：用於分層抽樣的（因子）變量。

sampsize：要繪制的樣本的大小。 對於分類，如果sampsize是地層數量的長度向量，則采樣按地層分層，而樣本的元素則表示從地層中抽取的數字。

作為參考，該函數的接口由下式給出：

 randomForest(x, y=NULL,  xtest=NULL, ytest=NULL, ntree=500,
              mtry=if (!is.null(y) && !is.factor(y))
              max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
              replace=TRUE, classwt=NULL, cutoff, strata,
              sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
              nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
              maxnodes = NULL,
              importance=FALSE, localImp=FALSE, nPerm=1,
              proximity, oob.prox=proximity,
              norm.votes=TRUE, do.trace=FALSE,
              keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
              keep.inbag=FALSE, ...)

我的問題是：如何使用strata和sampsize ？ 這是一個最小的工作示例，我想測試這些參數：

library(randomForest)
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width", "iris.type")

model = randomForest(iris.type ~ sepal.length + sepal.width, data = iris)

> model
500 samples
  6 predictors
  2 classes: 'Y0', 'Y1' 

No pre-processing
Resampling: Bootstrap (7 reps) 

Summary of sample sizes: 477, 477, 477, 477, 477, 477, ... 

Resampling results across tuning parameters:

  mtry  ROC    Sens  Spec  ROC SD  Sens SD  Spec SD
  2     0.763  1     0     0.156   0        0      
  4     0.782  1     0     0.231   0        0      
  6     0.847  1     0     0.173   0        0      

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 6.

我來參考這些參數，因為我希望RF使用自舉樣本，這些樣本尊重我的數據中的負數與負數的比例。

這個另一個線程，開始討論這個主題，但它解決了，但沒有說明如何使用這些參數。

Answer 1

這不會是這樣的：

model = randomForest(iris.type ~ sepal.length + sepal.width, 
                     data = iris, 
                     sampsize=c(10,10,10), strata=iris$iris.type)

我確實試過..., strata=iristype和..., strata='iristype'但顯然代碼並沒有被寫入來解釋'data'參數環境中的那個值。 我使用了結果變量，因為它是該數據集中唯一的因子變量，但我不認為它必須是結果變量。 事實上，我認為它絕對不應該是結果變量。 預計此特定模型將產生無用的輸出，並且僅用於測試語法。

R中隨機森林的分層抽樣

問題描述

1 個解決方案

解決方案1
7 已采納 2013-02-12 21:43:06

R中隨機森林的分層抽樣

問題描述

1 個解決方案

解決方案1 7 已采納 2013-02-12 21:43:06

解決方案1
7 已采納 2013-02-12 21:43:06