简体   繁体   English

如何使用线性支持向量机(SVM)分类器确定最重要/信息功能

[英]How to determine most Important/Informative features using Linear Support Vector Machines (SVM) classifier

I am new to python and working on a text classification problem. 我是python的新手,正在处理文本分类问题。 I am interested in the visualization of the most important features of each class through a linear SVM classifier model. 我对通过线性SVM分类器模型可视化每个类的最重要特征感兴趣。 I want to determine which features are contributing towards the classification decision as Class-1 or Class-2 by classification model. 我想通过分类模型确定哪些特征对分类决策有贡献为Class-1或Class-2。 This is my code. 这是我的代码。

df = pd.read_csv('projectdatacor.csv')
df = df[pd.notnull(df['types'])]
my_types = ['Requirement','Non-Requirement']

#converting to lower case
df['description'] = df.description.map(lambda x: x.lower()) 

#Removing the punctuation
df['description'] = df.description.str.replace('[^\w\s]', '')  


#splitting the word into tokens
df['description'] = df['description'].apply(nltk.tokenize.word_tokenize) 


## This converts the list of words into space-separated strings
df['description'] = df['description'].apply(lambda x: ' '.join(x))
count_vect = CountVectorizer()  
counts = count_vect.fit_transform(df['description']) 


#tf-idf
transformer = TfidfTransformer().fit(counts)
counts = transformer.transform(counts)  

#splitting the data and training the model
#naives-bayes
X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39)

#svc classification
from sklearn import svm
svclassifier = svm.SVC(gamma=0.001, C=100., kernel = 'linear')

svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test) 

#evalutaing the model
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))  
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_types))

I have read all related questions available on this platform but I found the following useful code which I added in my code. 我已阅读此平台上提供的所有相关问题,但我发现了以下有用的代码,我在代码中添加了这些代码。

import numpy as np 
def show_most_informative_features(vectorizer, clf, n=20): 
    feature_names = vectorizer.get_feature_names() 
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) 
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) 
    for (coef_1, fn_1), (coef_2, fn_2) in top: 
        print ("\t%.4f\t%-15s\t\t%.4f\t%-15s")  % (coef_1, fn_1, coef_2, fn_2) 
show_most_informative_features(count_vect, svclassifier, 20)

This code works for naive Bayes and logistic regression and it gives the most important features but for SVM it gives me the error. 此代码适用于朴素贝叶斯和逻辑回归,它提供了最重要的功能,但对于SVM,它给了我错误。

I am getting this error. 我收到了这个错误。

  File "C:\Users\fhassan\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
    execfile(filename, namespace)

  File "C:\Users\fhassan\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "U:/FAHAD UL HASSAN/Python Code/happycsv.py", line 209, in <module>
    show_most_informative_features(count_vect, svclassifier, 20)

  File "U:/FAHAD UL HASSAN/Python Code/happycsv.py", line 208, in show_most_informative_features
    print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

TypeError: must be real number, not csr_matrix

Any help shall be highly appreciated. 任何帮助都将受到高度赞赏。

Maybe this will help you: 也许这会对你有所帮助:

from sklearn import svm
import pandas as pd
import numpy as np 

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

x=X.toarray()
y=[0,0,0,1]

model=svm.SVC(kernel='linear')

a=model.fit(x,y)
model.score(x,y)

feature_names = vectorizer.get_feature_names() 
coefs_with_fns = sorted(zip(model.coef_[0], feature_names)) 
df=pd.DataFrame(coefs_with_fns)
df.columns='coefficient','word'
df.sort_values(by='coefficient')

You will get: 你会得到:

产量

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何确定对线性 SVM 贡献最大的特征? - How to determine the most contributing features for a Linear SVM? 从非常简单的scikit-learn SVM分类器中获取最丰富的功能 - Get most informative features from very simple scikit-learn SVM classifier 使用SGD查找SVM最相关或最重要的功能(丢失=铰链) - Finding The Most Relevant or Important Features for SVM using SGD (loss=hinge) 如何将分类器最丰富的功能保存到变量中? (Python NLTK) - How do I save classifier's most informative features into a variable? (Python NLTK) 如何为不同类别的scikit-learn分类器获取最丰富的信息? - How to get most informative features for scikit-learn classifier for different class? 确定 sklearn 中 SVM 分类器的最大贡献特征 - Determining the most contributing features for SVM classifier in sklearn 如何解释ntlk包中的“信息最丰富” - How to interpret the “most informative features” in ntlk package 有没有办法打印 Light GBM 分类器模型最重要的特征列表? - Is there a way to print a list of the most important features of an Light GBM Classifier model? 最重要的特征 高斯朴素贝叶斯分类器 python sklearn - Most important features Gaussian Naive Bayes classifier python sklearn 如何删除sklearn线性SVM中10%最具预测性的功能 - How to remove the 10% most highly predictive features in sklearn's linear SVM
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM