简体繁体 English

森林随机结果的多元分析

[英]Multivariate Analysis on random forest results

原文 2017-08-28 13:46:14 8 1 r/ random-forest

Apologies in advance for no data samples: 事先为没有数据样本道歉：

I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. 我建立了一个随机森林，其中有128棵树，没有任何调整，具有1个二进制结果和4个说明性连续变量。 I then compared the AUC of this forest against a forest already built and predicting on cases. 然后，我将该森林的AUC与已经建立并根据案例进行预测的森林进行了比较。 What I want to figure out is how to determine what exactly is lending predictive power to this new forest. 我想弄清楚的是如何确定对这个新森林的预测能力到底是什么。 Univariate analysis with the outcome variable led to no significant findings. 结果变量的单变量分析未导致重大发现。 Any technique recommendations would be greatly appreciated. 任何技术建议将不胜感激。

EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power. 编辑：总而言之，我想对这4个解释变量进行多元分析，以识别正在发生的相互作用可以解释森林的预测能力。

1 个解决方案

Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. 随机森林是一种所谓的“黑匣子”学习算法，因为没有很好的方法来解释输入变量和结果变量之间的关系。 You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions. 但是，您可以使用诸如变量重要性图或偏依性图之类的东西来使您了解哪些变量在做出预测中贡献最大。

Here are some discussions on variable importance plots , also here and here . 这里是关于可变重要性图的一些讨论，也在这里和这里。 It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp() . 它在randomForest包中以varImpPlot() ，在caret包中以varImp() 。 The interpretation of this plot depends on the metric you are using to assess variable importance. 该图的解释取决于您用来评估变量重要性的度量。 For example if you use MeanDecreaseAccuracy , a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount. 例如，如果您使用MeanDecreaseAccuracy ，则变量的高值意味着平均而言，包含此变量的模型可以将分类错误降低很多。

Here are some other discussions on partial dependence plots for predictive models , also here . 这也是关于预测模型的部分依赖图的其他讨论，也在这里。 It is implemented in the randomForest package as partialPlot() . 它在randomForest包中实现为partialPlot() 。

In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. 实际上，4个解释变量并不多，因此您可以轻松地运行二进制逻辑回归（可能使用L2正则化）以获得更具解释性的模型。 And compare it's performance against a random forest. 并将其与随机森林的性能进行比较。 See this discussion about variable selection . 请参阅有关变量选择的讨论。 It is implemented in the glmnet package. 它在glmnet软件包中实现。 Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. 基本上，L2正则化（也称为脊）是添加到损失函数中的惩罚项，它会收缩系数以减少方差，但会增加偏差。 This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). 如果减少的方差大于补偿偏差（通常是这种情况），则可以有效地减少预测误差。 Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). 由于您只有4个输入变量，因此我建议使用L2代替L1（也称为套索，它也可以执行自动特征选择）。 See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet : How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables? 有关使用cv.glmnet调整岭和套索收缩参数的信息， cv.glmnet ：如何在> 50K变量的套索或岭回归中估算收缩参数？