简体   繁体   English

Scikit 学习分类

[英]Scikit learn-Classification

Is there a straightforward way to view the top features of each class?有没有一种直接的方法可以查看每个类的主要功能? Based on tfidf?基于tfidf?

I am using KNeighbors classifer, SVC-Linear, MultinomialNB.我正在使用 KNeighbors 分类器、SVC-Linear、MultinomialNB。

Secondly, I have been searching for a way to view documents that have not been classified correctly?其次,我一直在寻找一种查看未正确分类的文档的方法? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.我可以查看混淆矩阵,但我想查看特定文档以查看导致错误分类的特征。

classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()  
predictions = classifier.predict(counts)

EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.编辑:我添加了代码片段,其中我只创建了一个 tfidf 向量化器并使用它来训练分类器。

Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.就像之前的评论所暗示的那样,一个更具体的问题会得到更好的答案,但我一直在使用这个包,所以我会尝试提供帮助。

I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. I. 在 sklearn 中确定分类类的顶级特征实际上取决于您使用的单个工具。 For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer ) come with the .feature_importances_ attribute which will score each feature based on its importance.例如,许多集成方法(如RandomForestClassifierGradientBoostingClassifer )都带有.feature_importances_属性,该属性将根据每个特征的重要性对其进行评分。 In contrast, most linear models (like LogisticRegression or RidgeClassifier ) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.相比之下,大多数线性模型(如LogisticRegressionRidgeClassifier )都有一个正则化惩罚,它会惩罚系数的大小,这意味着系数大小在某种程度上反映了特征的重要性(尽管您需要记住各个特征的数字尺度) 可以使用模型类的.coef_属性访问。

In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model.总之,几乎所有 sklearn 模型都有一些提取特征重要性的方法,但方法因模型而异。 Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach.幸运的是 sklearn 文档很棒,所以我会阅读您的特定模型以确定您的最佳方法。 Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.此外,除了特定于模型的 API 外,请确保阅读与您的问题类型相关的用户指南

II.二、 There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.没有开箱即用的 sklearn 方法来提供错误分类的记录,但是如果您使用 Pandas DataFrame(您应该使用它)来提供模型,则可以通过这样的几行代码来完成。

import pandas as pd
from sklearn.linear_model import RandomForestClassifier

df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]

mod = RandomForestClassifier()
mod.fit(x.values, y.values)

df['predict'] = mod.predict(x.values)

incorrect = df[df['predict']!=df[<target column>]]

The resultant incorrect DataFrame will contain only records which are misclassified.结果incorrect DataFrame 将只包含错误分类的记录。

Hope this helps!希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM