決策樹的特征重要性提取（scikit-learn）

Question

我一直試圖掌握我建模的決策樹中使用的特征的重要性。 我有興趣發現在節點處選擇的每個特征的權重以及術語本身。 我的數據是一堆文件。 這是我的決策樹代碼，我修改了 scikit-learn 中提取的代碼片段（ http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html ）：

from sklearn.feature_extraction.text import TfidfVectorizer

### Feature extraction
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords,
                                 use_idf=True, tokenizer=None, ngram_range=(1,2))#ngram_range=(1,0)

tfidf_matrix = tfidf_vectorizer.fit_transform(data[:, 1]) 
terms = tfidf_vectorizer.get_features_names()
### Define Decision Tree and fit
dtclf = DecisionTreeClassifier(random_state=1234)

dt = data.copy()

y = dt["label"]
X = tfidf_matrix

fitdt = dtclf.fit(X, y)

from sklearn.datasets import load_iris
from sklearn import tree

### Visualize Devision Tree

with open('data.dot', 'w') as file:
    tree.export_graphviz(dtclf, out_file = file, feature_names = terms)
file.close()

import subprocess
subprocess.call(['dot', '-Tpdf', 'data.dot', '-o' 'data.pdf'])

### Extract feature importance

importances = dtclf.feature_importances_

indices = np.argsort(importances)[::-1]

# Print the feature ranking
print('Feature Ranking:')

for f in range(tfidf_matrix.shape[1]):
    if importances[indices[f]] > 0:
        print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
        print ("feature name: ", terms[indices[f]])

我是否正確假設使用 terms[indices[f]] （這是特征項向量）將打印用於在某個節點處拆分樹的實際特征項？
用 GraphViz 可視化的決策樹有例如 X[30]，我假設這是指特征項的數值解釋。 如何提取術語本身，以便驗證我在 #1 中部署的流程？

更新代碼

fitdt = dtclf.fit(X, y)
with open(...):
tree.export_graphviz(dtclf, out_file = file, feature_names = terms)

提前致謝

Answer 1

對於您的第一個問題，您需要使用terms = tfidf_vectorizer.get_feature_names()從矢量化器中獲取特征名稱。 對於第二個問題，您可以使用feature_names = terms調用export_graphviz以獲取變量的實際名稱以顯示在可視化中（查看export_graphviz的完整文檔，了解可能對改進可視化有用的許多其他選項。

決策樹的特征重要性提取（scikit-learn）

問題描述

更新代碼

1 個解決方案

解決方案1
0 2015-12-13 04:43:10

決策樹的特征重要性提取（scikit-learn）

問題描述

更新代碼

1 個解決方案

解決方案1 0 2015-12-13 04:43:10

解決方案1
0 2015-12-13 04:43:10