简体   繁体   中英

Regression like quantification of the importance of variables in random forest

Is it possible to quantify the importance of variables in figuring out the probability of an observation falling into one class. Something similar to Logistic regression.

For example: If I have the following independent variables 1) Number of cats the person has 2) Number of dogs a person has 3) Number of chickens a person has

With my dependent variable being: Whether a person is a part of PETA or not

Is it possible to say something like "if the person adopts one more cat than his existing range of animals, his probability of being a part of PETA increases by 0.12"

I am currently using the following methodology to reach this particular scenario: 1) Build a random forest model using the training data 2) Predict the customer's probability to fall in one particular class(Peta vs non Peta) 3) Artificially increase the number of cats owned by each observation by 1 4) Predict the customer's new probability to fall in one of the two classes 5) The average change between (4)'s probability and (2)'s probability is the average increase in a person's probability if he has adopted a cat.

Does this make sense? Is there any flaw in the methodology that I haven't thought of? Is there a better way of doing the same ?

If you're using scikitlearn, you can easily do this by accessing the feature_importance_ property of the fitted RandomForestClassifier. According to SciKitLearn:

The relative rank (ie depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features. By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.

The property feature_importance_ stores the average depth of each feature among the trees. Here's an example. Let's start by importing the necessary libraries.

# using this for some array manipulations
import numpy as np
# of course we're going to plot stuff!
import matplotlib.pyplot as plt

# dummy iris dataset
from sklearn.datasets import load_iris
#random forest classifier
from sklearn.ensemble import RandomForestClassifier

Once we have these, we're going to load the dummy dataset, define a classification model and fit the data to the model.

data = load_iris()
​
# we're gonna use 100 trees
forest = RandomForestClassifier(n_estimators = 100)
​
# fit data to model by passing features and labels
forest.fit(data.data, data.target)

Now we can use the Feature Importance Property to get a score of each feature, based on how well it is able to classify the data into different targets.

# find importances of each feature
importances = forest.feature_importances_
# find the standard dev of each feature to assess the spread 
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
​
# find sorting indices of importances (descending)
indices = np.argsort(importances)[::-1]
​
# Print the feature ranking
print("Feature ranking:")
​
for f in range(data.data.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

Feature ranking: 1. feature 2 (0.441183) 2. feature 3 (0.416197) 3. feature 0 (0.112287) 4. feature 1 (0.030334)

Now we can plot the importance of each feature as a bar graph and decide if it's worth keeping them all. We also plot the error bars to assess the significance.

plt.figure()
plt.title("Feature importances")
plt.bar(range(data.data.shape[1]), importances[indices],
       color="b", yerr=std[indices], align="center")
plt.xticks(range(data.data.shape[1]), indices)
plt.xlim([-1, data.data.shape[1]])
plt.show()

Bar graph of feature importances

I apologize. I didn't catch the part where you mention what kind of statement you're trying to make. I'm assuming your response variable is either 1 or zero. You could try something like this:

  1. Fit a linear regression model over the data. This won't really give you the most accurate fit, but it will be robust to get the information you're looking for.
  2. Find the response of the model with the original inputs. (It most likely won't be ones or zeros)
  3. Artificially change the inputs, and find the difference in the outputs of the original data and modified data, like you suggested in your question.

Try it out with a logistic regression as well. It really depends on your data and how it is distributed to find what kind of regression will work best. You definitely have to use a regression to find the change in probability with change in input.

You can even try a single layer neural network with a regression/linear output layer to do the same thing. Add layers or non-linear activation functions if the data is less likely to be linear.

Cheers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM