Feature Importance extraction of Decision Trees (scikit-learn)

Question

I've been trying to get a grip on the importance of features used in a decision tree i've modelled. I'm interested in discovering the weight of each feature selected at the nodes as well as the term itself. My data is a bunch of documents. This is my code for the decision tree, I modified the code snippet from scikit-learn that extract ( http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html ):

from sklearn.feature_extraction.text import TfidfVectorizer

### Feature extraction
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords,
                                 use_idf=True, tokenizer=None, ngram_range=(1,2))#ngram_range=(1,0)

tfidf_matrix = tfidf_vectorizer.fit_transform(data[:, 1]) 
terms = tfidf_vectorizer.get_features_names()
### Define Decision Tree and fit
dtclf = DecisionTreeClassifier(random_state=1234)

dt = data.copy()

y = dt["label"]
X = tfidf_matrix

fitdt = dtclf.fit(X, y)

from sklearn.datasets import load_iris
from sklearn import tree

### Visualize Devision Tree

with open('data.dot', 'w') as file:
    tree.export_graphviz(dtclf, out_file = file, feature_names = terms)
file.close()

import subprocess
subprocess.call(['dot', '-Tpdf', 'data.dot', '-o' 'data.pdf'])

### Extract feature importance

importances = dtclf.feature_importances_

indices = np.argsort(importances)[::-1]

# Print the feature ranking
print('Feature Ranking:')

for f in range(tfidf_matrix.shape[1]):
    if importances[indices[f]] > 0:
        print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
        print ("feature name: ", terms[indices[f]])

Am I correct in assuming that using terms[indices[f]] (which is the feature term vector ) will print the actual feature term used to split the tree at a certain node?
The decision tree visualised with GraphViz has for instance X[30], I'm assuming this is refers to the numerical interpretation of the feature term. How do I extract the term itself so I can validate the process I deployed in #1?

Updated code

fitdt = dtclf.fit(X, y)
with open(...):
tree.export_graphviz(dtclf, out_file = file, feature_names = terms)

Thanks in advance

Answer 1

For you first question you need to get the feature names out of the vectoriser with terms = tfidf_vectorizer.get_feature_names() . For your second question, you can you can call export_graphviz with feature_names = terms to get the actual names of your variables to appear in your visualisation (check out the full documentation of export_graphviz for many other options that may be useful for improving your visualisation.

Feature Importance extraction of Decision Trees (scikit-learn)

Question

Updated code

1 answers

solution1
0 2015-12-13 04:43:10

Feature Importance extraction of Decision Trees (scikit-learn)

Question

Updated code

1 answers

solution1 0 2015-12-13 04:43:10

solution1
0 2015-12-13 04:43:10