简体   繁体   English

回归分析,例如量化随机森林中变量的重要性

[英]Regression like quantification of the importance of variables in random forest

Is it possible to quantify the importance of variables in figuring out the probability of an observation falling into one class. 是否有可能量化变量的重要性,以弄清观察结果落入一类的可能性。 Something similar to Logistic regression. 类似于Logistic回归。

For example: If I have the following independent variables 1) Number of cats the person has 2) Number of dogs a person has 3) Number of chickens a person has 例如:如果我具有以下自变量1)一个人的猫数2)一个人的狗数3)一个人的鸡数

With my dependent variable being: Whether a person is a part of PETA or not 我的因变量是:一个人是否是PETA的一部分

Is it possible to say something like "if the person adopts one more cat than his existing range of animals, his probability of being a part of PETA increases by 0.12" 是否可以说这样的话:“如果该人收养的猫多于其现有动物的范围,则其成为PETA的几率增加了0.12”。

I am currently using the following methodology to reach this particular scenario: 1) Build a random forest model using the training data 2) Predict the customer's probability to fall in one particular class(Peta vs non Peta) 3) Artificially increase the number of cats owned by each observation by 1 4) Predict the customer's new probability to fall in one of the two classes 5) The average change between (4)'s probability and (2)'s probability is the average increase in a person's probability if he has adopted a cat. 我目前正在使用以下方法来实现此特定场景:1)使用训练数据构建随机森林模型2)预测客户落入某一特定类别(Peta与非Peta)的概率3)人为地增加猫的数量每个观察点拥有1的4)预测客户落入两个类别之一的新概率5)(4)的概率与(2)的概率之间的平均变化是一个人的概率平均增加已经收养了一只猫。

Does this make sense? 这有意义吗? Is there any flaw in the methodology that I haven't thought of? 我没有想到的方法上有什么缺陷吗? Is there a better way of doing the same ? 有没有更好的方法可以做到这一点?

If you're using scikitlearn, you can easily do this by accessing the feature_importance_ property of the fitted RandomForestClassifier. 如果您使用的是scikitlearn,则可以通过访问已拟合的RandomForestClassifier的feature_importance_属性轻松地做到这一点。 According to SciKitLearn: 根据SciKitLearn:

The relative rank (ie depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. 用作树中决策节点的特征的相对等级(即深度)可用于评估该特征相对于目标变量的可预测性的相对重要性。 Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. 在树的顶部使用的特征有助于更大比例的输入样本的最终预测决策。 The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features. 他们所贡献的样本的预期比例因此可以用作特征相对重要性的估计。 By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection. 通过在几棵随机树上平均那些预期的活动率,可以减少这种估计的方差并将其用于特征选择。

The property feature_importance_ stores the average depth of each feature among the trees. 属性feature_importance_存储树中每个特征的平均深度。 Here's an example. 这是一个例子。 Let's start by importing the necessary libraries. 让我们从导入必要的库开始。

# using this for some array manipulations
import numpy as np
# of course we're going to plot stuff!
import matplotlib.pyplot as plt

# dummy iris dataset
from sklearn.datasets import load_iris
#random forest classifier
from sklearn.ensemble import RandomForestClassifier

Once we have these, we're going to load the dummy dataset, define a classification model and fit the data to the model. 一旦有了这些,我们将加载虚拟数据集,定义分类模型并将数据拟合到模型中。

data = load_iris()
​
# we're gonna use 100 trees
forest = RandomForestClassifier(n_estimators = 100)
​
# fit data to model by passing features and labels
forest.fit(data.data, data.target)

Now we can use the Feature Importance Property to get a score of each feature, based on how well it is able to classify the data into different targets. 现在,我们可以根据要素能够将数据分类到不同目标的程度,使用要素重要性属性来为其评分。

# find importances of each feature
importances = forest.feature_importances_
# find the standard dev of each feature to assess the spread 
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
​
# find sorting indices of importances (descending)
indices = np.argsort(importances)[::-1]
​
# Print the feature ranking
print("Feature ranking:")
​
for f in range(data.data.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

Feature ranking: 1. feature 2 (0.441183) 2. feature 3 (0.416197) 3. feature 0 (0.112287) 4. feature 1 (0.030334) 功能排名:1.功能2(0.441183)2.功能3(0.416197)3.功能0(0.112287)4.功能1(0.030334)

Now we can plot the importance of each feature as a bar graph and decide if it's worth keeping them all. 现在,我们可以将每个功能的重要性绘制为条形图,并确定是否值得保留所有功能。 We also plot the error bars to assess the significance. 我们还绘制了误差线以评估其重要性。

plt.figure()
plt.title("Feature importances")
plt.bar(range(data.data.shape[1]), importances[indices],
       color="b", yerr=std[indices], align="center")
plt.xticks(range(data.data.shape[1]), indices)
plt.xlim([-1, data.data.shape[1]])
plt.show()

Bar graph of feature importances 功能重要性的条形图

I apologize. 我道歉。 I didn't catch the part where you mention what kind of statement you're trying to make. 在您提到要发表什么样的陈述时,我没有注意到。 I'm assuming your response variable is either 1 or zero. 我假设您的响应变量为1或零。 You could try something like this: 您可以尝试这样的事情:

  1. Fit a linear regression model over the data. 在数据上拟合线性回归模型。 This won't really give you the most accurate fit, but it will be robust to get the information you're looking for. 这实际上并不能为您提供最准确的拟合,但是获取所需的信息将非常可靠。
  2. Find the response of the model with the original inputs. 用原始输入找到模型的响应。 (It most likely won't be ones or zeros) (很可能不会是一或零)
  3. Artificially change the inputs, and find the difference in the outputs of the original data and modified data, like you suggested in your question. 像您在问题中建议的那样,人为地更改输入,并找到原始数据和修改后的数据的输出差异。

Try it out with a logistic regression as well. 也可以通过逻辑回归进行尝试。 It really depends on your data and how it is distributed to find what kind of regression will work best. 它实际上取决于您的数据及其分布方式,以找到哪种回归效果最好。 You definitely have to use a regression to find the change in probability with change in input. 您绝对必须使用回归来找到输入变化带来的概率变化。

You can even try a single layer neural network with a regression/linear output layer to do the same thing. 您甚至可以尝试使用具有回归/线性输出层的单层神经网络来执行相同的操作。 Add layers or non-linear activation functions if the data is less likely to be linear. 如果数据不太可能是线性的,则添加图层或非线性激活函数。

Cheers! 干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM