二项式响应的随机森林变量重要性和相关方向

Question

I am using the randomForest package in R, but am not partial to solutions using other packages.我在 R 中使用 randomForest 包，但不偏爱使用其他包的解决方案。 my RF model is using various continuous and categorical variables to predict extinction risk (Threatened, Non_Threatened).我的 RF 模型使用各种连续和分类变量来预测灭绝风险（受威胁、非受威胁）。 I would like to be able to show the direction of variable importance for predictors used in my RF model.我希望能够显示我的 RF 模型中使用的预测变量的可变重要性方向。 Other publications have done exactly this: Figure 1 in https://www.pnas.org/content/pnas/109/9/3395.full.pdf其他出版物正是这样做的： https ://www.pnas.org/content/pnas/109/9/3395.full.pdf 中的图 1

Any ideas on how to do something similar?关于如何做类似事情的任何想法？ One suggestion I read said to simply compare the difference between two partial dependence plots (example below), but I feel this may not be the best way.我读过的一个建议是简单地比较两个部分依赖图之间的差异（下面的例子），但我觉得这可能不是最好的方法。 Any help would be greatly appreciated.任何帮助将不胜感激。

partialPlot(final_rf, rf_train, size_mat,"Threatened")
partialPlot(final_rf, rf_train, size_mat,"Non_Threatened")

response = Threatened回应 = 受到威胁

回应 = 受到威胁

response = Non_Threatened响应 = Non_Threatened

响应 = Non_Threatened

Answer 1

You could use something like an average marginal effect (or like below, an average first difference) approach.您可以使用诸如平均边际效应（或如下所示，平均一阶差分）之类的方法。

First, I'll make some data首先，我将制作一些数据

set.seed(11)
n  = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
  (X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5] 
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))

Next, we'll estimate the RF model:接下来，我们将估计 RF 模型：

library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)

Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x variable for all observations. Net，我们可以循环遍历每个变量，在每次循环中，我们都为所有观察值向单个x变量添加一个标准差。 In your approach, you could also change from one category to another for categorical variables.在您的方法中，您还可以将分类变量从一个类别更改为另一个类别。 Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable.然后，我们预测在两种条件下出现正响应的概率 - 原始条件和每个变量都添加了标准差的条件。 Then we could take the difference and summarize.然后我们可以取差异并总结。

nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
  p1 <- predict(rf, newdata=df, type="prob")
  df2 <- df
  df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
  p2 <- predict(rf, newdata=df2, type="prob")
  diff <- (p2-p1)[,2]
  res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
#       effect         sd
# X1  0.11079 0.18491252
# X2  0.10265 0.16552070
# X3  0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836

二项式响应的随机森林变量重要性和相关方向

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-10-16 12:20:37

二项式响应的随机森林变量重要性和相关方向

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-10-16 12:20:37

解决方案1
2 已采纳 2020-10-16 12:20:37