简体   繁体   English

randomForest 包的函数重要性

[英]function importance for randomForest package

I wanted to use the Random Forest to find the most important features for a classification problem (I have two classes: 0 and 1).我想使用随机森林来为分类问题找到最重要的特征(我有两个类:0 和 1)。

I created the model:我创建了模型:

rf = randomForest(y  ~ ., data = df, sampsize=100000,ntree=100, importance = TRUE, keep.forest = FALSE)

And then I used the following to check the importance:然后我使用以下内容来检查重要性:

importance(rf, type = 1, class = 1)

I read that the class parameter can be used for a classification problem.我读到 class 参数可用于分类问题。 My question is if I have to sort the results by their absolute value in Mean Decrease accuracy.我的问题是我是否必须按平均降低精度中的绝对值对结果进行排序。 When I use the VarImpPlot seems that I should consider also the negative values.当我使用VarImpPlot似乎我也应该考虑负值。 And what exactly does the parameter class = 1 ?参数class = 1究竟是什么?

We can use the iris dataset, it has 3 species in it:我们可以使用 iris 数据集,它有 3 个物种:

data(iris) table(iris$Species)数据(虹膜)表(虹膜$物种)

setosa versicolor  virginica 
    50         50         50 

We fit a random forest:我们拟合一个随机森林:

library(randomForest)
mdl = randomForest(Species~.,data=iris,importance=TRUE)
# let's do it without options
importance(mdl)
                setosa versicolor virginica MeanDecreaseAccuracy
Sepal.Length  6.364533  6.2112640  7.632076            10.365371
Sepal.Width   4.790211  0.4339124  5.500338             5.153676
Petal.Length 22.027701 34.5777755 29.080648            35.215194
Petal.Width  22.500729 31.1403378 30.714576            33.335003
             MeanDecreaseGini
Sepal.Length         9.223319
Sepal.Width          2.189763
Petal.Length        44.703684
Petal.Width         43.163546

The above table is all your results, if you do importance(mdl,type=1) you get decrease in mean accuracy across all classes for this variable.上表是您的所有结果,如果您执行 important(mdl,type=1) 操作,则该变量的所有类别的平均准确度都会降低。 You see three separate columns for each class you can predict (setosa, versicolor,virginica), so if you do:对于可以预测的每个类(setosa、versicolor、virginica),您会看到三个单独的列,因此如果您这样做:

importance(mdl,type=1,class="setosa")
                setosa
Sepal.Length  6.364533
Sepal.Width   4.790211
Petal.Length 22.027701
Petal.Width  22.500729

You can the change in accuracy associated with this class.您可以更改与此类相关的准确性。

So in your code, when you do importance(rf, type = 1, class = 1) , and your model is randomForest(y ~ ., data = df... ) , you are trying to find the importance of the variable, associated with predicted which has the label 1 in y.因此,在您的代码中,当您执行importance(rf, type = 1, class = 1)并且您的模型是randomForest(y ~ ., data = df... ) ,您试图找到变量的重要性,与在 y 中具有标签 1 的预测相关联。

Lastly, you can sort them like:最后,您可以对它们进行排序:

res = importance(mdl,type=1,class="setosa")
res = res[order(res[,1],decreasing=TRUE),drop=FALSE,]
res

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM