了解“randomForest”R 包中每类变量的重要性

Question

I'm having trouble understanding the by class columns in the importance function inside of randomForest.我无法理解 randomForest 内importance函数中的按类列。

My data set has two classes, "Current" and "Departed".我的数据集有两个类，“Current”和“Departed”。 To predict those classes,为了预测这些类别，

I first create a random forest model:我首先创建一个随机森林模型：

fit <- randomForest(IsDeparted ~ ..., df_train),

Then I run the importance function:然后我运行importance函数：

importance(fit)

Now I get a snippet of results like this, importance measure in four columns: "Current" "Departed" "MDA" "GiniDecrease"现在我得到了这样的结果片段，四列中的重要性度量：“当前”“离开”“MDA”“GiniDecrease”

Could someone explain how to interpret the first two class columns?有人可以解释如何解释前两个类列吗？ Is it the mean decrease in accuracy of predicting one particular class after permuting values of that particular variable?它是在对特定变量的值进行置换后预测一个特定类别的准确性的平均下降吗？ And if so, does that mean I should focus on those columns rather than the MDA column when doing feature selection if I am more interested in the model's performance for one particular class?如果是这样，那是否意味着如果我对某个特定类的模型性能更感兴趣，那么在进行特征选择时我应该关注那些列而不是 MDA 列？

Answer 1

Yes, the first two columns are for the specific classes.是的，前两列是针对特定类的。 It is the mean decrease in accuracy scaled by their own standard errors.它是由他们自己的标准误差衡量的准确度平均下降。 If you are interested in the accuracy of one class, you can look at that.如果您对一类的准确性感兴趣，可以查看它。

Let's use an example, where the default importance() function returns a scaled importance:让我们举一个例子，其中默认的 important() 函数返回一个缩放的重要性：

library(randomForest)
set.seed(111)
fit = randomForest(Species ~ .,data=iris,importance=TRUE)
importance(fit)

                setosa versicolor virginica MeanDecreaseAccuracy
Sepal.Length  6.716993  7.4654657  7.697842            10.869088
Sepal.Width   4.581990 -0.5208697  4.224459             3.772957
Petal.Length 22.155981 33.0549839 27.892363            33.272150
Petal.Width  22.497643 31.4966353 31.589361            33.123064
             MeanDecreaseGini
Sepal.Length         9.333510
Sepal.Width          2.425592
Petal.Length        43.324744
Petal.Width         44.146107

If you look at the unscaled, you can see the MDA column is roughly the average of the 3 classes, in this case because the 3 classes are balanced.如果您查看未缩放，您可以看到 MDA 列大致是 3 个类的平均值，在这种情况下，因为 3 个类是平衡的。 If you have imbalanced class it will be different:如果您的班级不平衡，则情况会有所不同：

                  setosa   versicolor   virginica MeanDecreaseAccuracy
Sepal.Length 0.034156211  0.021093423 0.036147901          0.030810465
Sepal.Width  0.006522917 -0.001117593 0.006937731          0.004273138
Petal.Length 0.329299111  0.301621639 0.296869242          0.305569113
Petal.Width  0.335363736  0.298729184 0.279526019          0.302855284
             MeanDecreaseGini
Sepal.Length         9.333510
Sepal.Width          2.425592
Petal.Length        43.324744
Petal.Width         44.146107

了解“randomForest”R 包中每类变量的重要性

问题描述

1 个解决方案

解决方案1
0 2020-06-20 22:49:04

了解“randomForest”R 包中每类变量的重要性

问题描述

1 个解决方案

解决方案1 0 2020-06-20 22:49:04

解决方案1
0 2020-06-20 22:49:04