如何為 scikit-learn 分類器獲取信息量最大的特征？

Question

liblinear 和 nltk 等機器學習包中的分類器提供了一種方法show_most_informative_features() ，這對於調試功能非常有幫助：

viagra = None          ok : spam     =      4.5 : 1.0
hello = True           ok : spam     =      4.5 : 1.0
hello = None           spam : ok     =      3.3 : 1.0
viagra = True          spam : ok     =      3.3 : 1.0
casino = True          spam : ok     =      2.0 : 1.0
casino = None          ok : spam     =      1.5 : 1.0

我的問題是是否為 scikit-learn 中的分類器實現了類似的功能。 我搜索了文檔，但找不到類似的東西。

如果還沒有這樣的功能，有人知道如何獲得這些值的解決方法嗎？

Answer 1

分類器本身不記錄特征名稱，它們只看到數字數組。 但是，如果您使用Vectorizer / CountVectorizer / TfidfVectorizer / DictVectorizer提取特征，並且您使用的是線性模型（例如LinearSVC或朴素貝葉斯），那么您可以應用文檔分類示例使用的相同技巧。 示例（未經測試，可能包含一兩個錯誤）：

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))

這是用於多類分類； 對於二進制情況，我認為您應該只使用clf.coef_[0] 。 您可能需要對class_labels進行排序。

Answer 2

在 larsmans 代碼的幫助下，我想出了這個二進制案例的代碼：

def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

Answer 3

為了添加更新， RandomForestClassifier現在支持.feature_importances_屬性。 該屬性告訴您該特征解釋了多少觀察到的方差。 顯然，所有這些值的總和必須 <= 1。

我發現這個屬性在執行特征工程時非常有用。

感謝 scikit-learn 團隊和貢獻者實現這一點！

編輯：這適用於 RandomForest 和 GradientBoosting。 所以RandomForestClassifier 、 RandomForestRegressor 、 GradientBoostingClassifier和GradientBoostingRegressor都支持這個。

Answer 4

我們最近發布了一個庫（ https://github.com/TeamHG-Memex/eli5 ），它允許這樣做：它處理來自 scikit-learn、二進制/多類情況的各種分類器，允許根據特征值突出顯示文本, 與 IPython 等集成。

Answer 5

我實際上必須在我的 NaiveBayes 分類器上找出特征重要性，雖然我使用了上述函數，但我無法根據類獲得特征重要性。 我瀏覽了 scikit-learn 的文檔並稍微調整了上述功能以發現它可以解決我的問題。 希望對你也有幫助！

def important_features(vectorizer,classifier,n=20):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()

    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]

    print("Important words in negative reviews")

    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)

    print("-----------------------------------------")
    print("Important words in positive reviews")

    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat)

請注意，您的分類器（在我的情況下是 NaiveBayes）必須具有屬性 feature_count_ 才能工作。

Answer 6

您還可以執行以下操作以按順序創建重要性特征圖：

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
         axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
#print("Feature ranking:")


# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(train[features].shape[1]), importances[indices],
   color="r", yerr=std[indices], align="center")
plt.xticks(range(train[features].shape[1]), indices)
plt.xlim([-1, train[features].shape[1]])
plt.show()

Answer 7

RandomForestClassifier還沒有coef_屬性，但我認為它會在 0.17 版本中出現。 但是，請參閱使用 scikit-learn 對隨機森林進行遞歸特征消除中的RandomForestClassifierWithCoef類。 這可能會給您一些解決上述限制的想法。

Answer 8

不完全是您要查找的內容，而是獲取最大幅度系數的快速方法（假設 Pandas 數據框列是您的特征名稱）：

您訓練模型如下：

lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(df, Y, test_size=0.25)
lr.fit(X_train, y_train)

獲取 10 個最大的負系數值（或更改為 reverse=True 以獲取最大正值），例如：

sorted(list(zip(feature_df.columns, lr.coef_)), key=lambda x: x[1], 
reverse=False)[:10]

Answer 9

首先列一個列表，我給這個列表命名標簽。 之后提取所有功能名稱和列名稱，我添加到標簽列表中。 這里我使用朴素貝葉斯模型。 在朴素貝葉斯模型中，feature_log_prob_ 給出特征的概率。

def top20(model,label):

  feature_prob=(abs(model.feature_log_prob_))

  for i in range(len(feature_prob)):

    print ('top 20 features for {} class'.format(i))

    clas = feature_prob[i,:]

    dictonary={}

    for count,ele in enumerate(clas,0): 

      dictonary[count]=ele

    dictonary=dict(sorted(dictonary.items(), key=lambda x: x[1], reverse=True)[:20])

    keys=list(dictonary.keys())

    for i in keys:

      print(label[i])

    print('*'*1000)

如何為 scikit-learn 分類器獲取信息量最大的特征？

問題描述

9 個解決方案

解決方案1
66 2012-06-20 09:51:55

解決方案2
54 已采納 2012-06-21 14:55:49

解決方案3
16 2016-08-13 07:31:42

解決方案4
13 2016-11-24 17:42:54

解決方案5
4 2018-06-12 06:42:28

解決方案6
1 2016-08-01 14:55:15

解決方案7
0 2015-07-28 18:35:13

解決方案8
0 2019-02-28 22:19:40

解決方案9
0 2020-01-09 18:33:24

如何為 scikit-learn 分類器獲取信息量最大的特征？

問題描述

9 個解決方案

解決方案1 66 2012-06-20 09:51:55

解決方案2 54 已采納 2012-06-21 14:55:49

解決方案3 16 2016-08-13 07:31:42

解決方案4 13 2016-11-24 17:42:54

解決方案5 4 2018-06-12 06:42:28

解決方案6 1 2016-08-01 14:55:15

解決方案7 0 2015-07-28 18:35:13

解決方案8 0 2019-02-28 22:19:40

解決方案9 0 2020-01-09 18:33:24

解決方案1
66 2012-06-20 09:51:55

解決方案2
54 已采納 2012-06-21 14:55:49

解決方案3
16 2016-08-13 07:31:42

解決方案4
13 2016-11-24 17:42:54

解決方案5
4 2018-06-12 06:42:28

解決方案6
1 2016-08-01 14:55:15

解決方案7
0 2015-07-28 18:35:13

解決方案8
0 2019-02-28 22:19:40

解決方案9
0 2020-01-09 18:33:24