简体   繁体   English

使用 Python 的随机森林特征重要性图表

[英]Random Forest Feature Importance Chart using Python

I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance.我正在 python 中使用 RandomForestRegressor,我想创建一个图表来说明特征重要性的排名。 This is the code I used:这是我使用的代码:

from sklearn.ensemble import RandomForestRegressor

MT= pd.read_csv("MT_reduced.csv") 
df = MT.reset_index(drop = False)

columns2 = df.columns.tolist()

# Filter the columns to remove ones we don't want.
columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]]

# Store the variable we'll be predicting on.
target = "Property_crime_rate"

# Let’s randomly split our data with 80% as the train set and 20% as the test set:

# Generate the training set.  Set random_state to be able to replicate results.
train2 = df.sample(frac=0.8, random_state=1)

#exclude all obs with matching index
test2 = df.loc[~df.index.isin(train2.index)]

print(train2.shape) #need to have same number of features only difference should be obs
print(test2.shape)

# Initialize the model with some parameters.

model = RandomForestRegressor(n_estimators=100, min_samples_leaf=8, random_state=1)

#n_estimators= number of trees in forrest
#min_samples_leaf= min number of samples at each leaf


# Fit the model to the data.
model.fit(train2[columns2], train2[target])
# Make predictions.
predictions_rf = model.predict(test2[columns2])
# Compute the error.
mean_squared_error(predictions_rf, test2[target])#650.4928

Feature Importance特征重要性

features=df.columns[[3,4,6,8,9,10]]
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

This feature importance code was altered from an example found on http://www.agcross.com/2015/02/random-forests-in-python-with-scikit-learn/此功能重要性代码是从http://www.agcross.com/2015/02/random-forests-in-python-with-scikit-learn/上的示例中更改的

I receive the following error when I attempt to replicate the code with my data:当我尝试用我的数据复制代码时收到以下错误:

  IndexError: index 6 is out of bounds for axis 1 with size 6

Also, only one feature shows up on my chart with 100% importance where there are no labels.此外,在没有标签的情况下,我的图表上仅显示一项具有 100% 重要性的功能。

Any help solving this issue so I can create this chart will be greatly appreciated.任何帮助解决此问题以便我可以创建此图表将不胜感激。

Here is an example using the iris data set.这是一个使用 iris 数据集的示例。

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
>>> rnd_clf.fit(iris["data"], iris["target"])
>>> for name, importance in zip(iris["feature_names"], rnd_clf.feature_importances_):
...     print(name, "=", importance)

sepal length (cm) = 0.112492250999
sepal width (cm) = 0.0231192882825
petal length (cm) = 0.441030464364
petal width (cm) = 0.423357996355

Plotting feature importance绘制特征重要性

>>> features = iris['feature_names']
>>> importances = rnd_clf.feature_importances_
>>> indices = np.argsort(importances)

>>> plt.title('Feature Importances')
>>> plt.barh(range(len(indices)), importances[indices], color='b', align='center')
>>> plt.yticks(range(len(indices)), [features[i] for i in indices])
>>> plt.xlabel('Relative Importance')
>>> plt.show()

特征重要性

Load the feature importances into a pandas series indexed by your column names, then use its plot method.将特征重要性加载到由列名索引的 Pandas 系列中,然后使用其 plot 方法。 eg for an sklearn RF classifier/regressor model trained using df :例如,对于使用df训练的 sklearn RF 分类器/回归器model

feat_importances = pd.Series(model.feature_importances_, index=df.columns)
feat_importances.nlargest(4).plot(kind='barh')

在此处输入图片说明

A barplot would be more than useful in order to visualize the importance of the features .一个barplot超过有用的,以可视化功能重要性

Use this (example using Iris Dataset):使用这个(使用虹膜数据集的例子):

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Create decision tree classifer object
clf = RandomForestClassifier(random_state=0, n_jobs=-1)
# Train model
model = clf.fit(X, y)

# Calculate feature importances
importances = model.feature_importances_
# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Rearrange feature names so they match the sorted feature importances
names = [iris.feature_names[i] for i in indices]

# Barplot: Add bars
plt.bar(range(X.shape[1]), importances[indices])
# Add feature names as x-axis labels
plt.xticks(range(X.shape[1]), names, rotation=20, fontsize = 8)
# Create plot title
plt.title("Feature Importance")
# Show plot
plt.show()

在此处输入图片说明

The method you are trying to apply is using built-in feature importance of Random Forest.您尝试应用的方法是使用随机森林的内置特征重要性。 This method can sometimes prefer numerical features over categorical and can prefer high cardinality categorical features.这种方法有时更喜欢数字特征而不是分类特征,并且可以更喜欢高基数分类特征。 Please see this article for details.详情请参阅这篇文章 There are two other methods to get feature importance (but also with their pros and cons).还有另外两种方法可以获得特征重要性(但也有它们的优缺点)。

Permutation based Feature Importance基于排列的特征重要性

In scikit-learn from version 0.22 there is method: permutation_importance .0.22版的scikit-learn有方法: permutation_importance It is model agnostic.它是模型不可知的。 It can even work with algorithms from other packages if they follow the scikit-learn interface.如果其他程序包遵循scikit-learn接口,它甚至可以与其他程序包中的算法一起使用。 The complete code example:完整的代码示例:

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
import shap
from matplotlib import pyplot as plt

# prepare the data
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)

# train the model
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

# the permutation based importance
perm_importance = permutation_importance(rf, X_test, y_test)

sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")

基于排列的随机森林变量重要性

The permutation-based importance can be computationally expensive and can omit highly correlated features as important.基于排列的重要性在计算上可能很昂贵,并且可以忽略高度相关的特征作为重要的。

SHAP based importance基于 SHAP 的重要性

Feature Importance can be computed with Shapley values (you need shap package).可以使用 Shapley 值计算特征重要性(您需要shap包)。

import shap
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

随机森林特征重要性 SHAP

Once SHAP values are computed, other plots can be done:一旦计算出 SHAP 值,就可以绘制其他图:

随机森林的 SHAP 汇总图

Computing SHAP values can be computationally expensive.计算 SHAP 值的计算成本可能很高。 The full example of 3 methods to compute Random Forest feature importance can be found in this blog post of mine.计算随机森林特征重要性的 3 种方法的完整示例可以在我的这篇博文中找到。

The y-ticks are not correct. y 刻度不正确。 To fix it, it should be要修复它,它应该是

plt.yticks(range(len(indices)), [features[i] for i in indices])

来自 spies006 的这段代码不起作用: plt.yticks(range(len(indices)), features[indices])所以你必须为plt.yticks(range(len(indices)),features.columns[indices])改变它

In the above code from spies006, "feature_names" didn't work for me.在上面来自 spies006 的代码中,“feature_names”对我不起作用。 A generic solution would be to use name_of_the_dataframe.columns.一个通用的解决方案是使用 name_of_the_dataframe.columns。

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Create decision tree classifer object
clf = RandomForestClassifier(random_state=0, n_jobs=-1)
# Train model
model = clf.fit(X, y)

feat_importances = pd.DataFrame(model.feature_importances_, index=iris.feature_names, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(8,6))

在此处输入图片说明

print(feat_importances)

and we get:我们得到:

                   Importance
petal width (cm)     0.489820
petal length (cm)    0.368047
sepal length (cm)    0.118965
sepal width (cm)     0.023167

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM