简体   繁体   中英

function importance for randomForest package

I wanted to use the Random Forest to find the most important features for a classification problem (I have two classes: 0 and 1).

I created the model:

rf = randomForest(y  ~ ., data = df, sampsize=100000,ntree=100, importance = TRUE, keep.forest = FALSE)

And then I used the following to check the importance:

importance(rf, type = 1, class = 1)

I read that the class parameter can be used for a classification problem. My question is if I have to sort the results by their absolute value in Mean Decrease accuracy. When I use the VarImpPlot seems that I should consider also the negative values. And what exactly does the parameter class = 1 ?

We can use the iris dataset, it has 3 species in it:

data(iris) table(iris$Species)

setosa versicolor  virginica 
    50         50         50 

We fit a random forest:

library(randomForest)
mdl = randomForest(Species~.,data=iris,importance=TRUE)
# let's do it without options
importance(mdl)
                setosa versicolor virginica MeanDecreaseAccuracy
Sepal.Length  6.364533  6.2112640  7.632076            10.365371
Sepal.Width   4.790211  0.4339124  5.500338             5.153676
Petal.Length 22.027701 34.5777755 29.080648            35.215194
Petal.Width  22.500729 31.1403378 30.714576            33.335003
             MeanDecreaseGini
Sepal.Length         9.223319
Sepal.Width          2.189763
Petal.Length        44.703684
Petal.Width         43.163546

The above table is all your results, if you do importance(mdl,type=1) you get decrease in mean accuracy across all classes for this variable. You see three separate columns for each class you can predict (setosa, versicolor,virginica), so if you do:

importance(mdl,type=1,class="setosa")
                setosa
Sepal.Length  6.364533
Sepal.Width   4.790211
Petal.Length 22.027701
Petal.Width  22.500729

You can the change in accuracy associated with this class.

So in your code, when you do importance(rf, type = 1, class = 1) , and your model is randomForest(y ~ ., data = df... ) , you are trying to find the importance of the variable, associated with predicted which has the label 1 in y.

Lastly, you can sort them like:

res = importance(mdl,type=1,class="setosa")
res = res[order(res[,1],decreasing=TRUE),drop=FALSE,]
res

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM