使用 Python 和 scikit-learn 进行 SVM 文本分类的最重要功能

Question

我是人工智能的新手。 我正在使用 SVM 算法并运行此 Python 脚本来训练/预测电子邮件是否为垃圾邮件。 该脚本有效：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm

# dependencies
# pip install pandas
# pip install -U scikit-learn

spam = pd.read_csv('Cartel1.csv')
z = spam['v2']
y = spam["v1"]

#Splitting our data into training and testing.
z_train, z_test,y_train, y_test = train_test_split(z,y,test_size = 0.2)
#Converting text into integer using CountVectorizer
cv = CountVectorizer()
features = cv.fit_transform(z_train)
svm = svm.SVC()
svm.fit(features,y_train)
features_test = cv.transform(z_test)

comment =["Sexy free Call and text messages on 08002986030"]
vect= cv.transform(comment) 
print("This comment: ", comment, " is: ", svm.predict(vect))#spam

comment2 =["Hi there, I am emailing you today to let you know we have created a new task for you."]
vect2= cv.transform(comment2) 
print("This comment: ", comment2, " is: ", svm.predict(vect2))#ham --no spam
#print(model.score(features_test,y_test))

但我希望我可以检查模型以获得分类为“垃圾邮件”和“火腿”的最常见词。 我想得到类似于这样的结果： Determining the most contribution features for SVM classifier in sklearn

我想获得归类为垃圾邮件或火腿的最常见单词。

Answer 1

链接的问题+这里的例子非常接近。

不过，让我们将其重写为一个最小的可重现示例。 SVM 根据特征区分正面和负面特征集的程度来分配特征的重要性，因此如果我们能够显示一些模棱两可的特征是否是垃圾邮件/非垃圾邮件，将会更有帮助：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
import numpy as np

raw_data = [
    "hello sexy free call and text messages on 08002986030",
    "hello see my text with a new task",
    "this is your boss please text me",
]
y = np.array([1.0, 0.0, 0.0])

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(raw_data)

请注意，“ text ”一词出现在所有三个示例中。

我们可以使用另一个问题中建议的线性核将 SVM 安装在向量化的X和y上：

svm = SVC(kernel="linear")
svm.fit(X, y)

可视化涉及 (1) 提取 SVM 系数，(2) 获取单词，(3) 排序和绘图：

coefs = svm.coef_.toarray().flatten()
words = vectorizer.get_feature_names_out()

coefs, words = zip(*sorted(zip(coefs, words), key=lambda x: x[0], reverse=True))
plt.barh(words, coefs)
plt.show()

请注意，“ text ”一词的系数为0.0 ，这意味着它无助于区分垃圾邮件/非垃圾邮件：

对于更大的数据集，系数和单词的集合可能会扩展到数千或数万。 在这种情况下，您可以应用排序方法，然后对前 10 个和后 10 个案例进行抽样。

使用 Python 和 scikit-learn 进行 SVM 文本分类的最重要功能

问题描述

1 个解决方案

解决方案1
0 2022-12-16 18:17:27

使用 Python 和 scikit-learn 进行 SVM 文本分类的最重要功能

问题描述

1 个解决方案

解决方案1 0 2022-12-16 18:17:27

解决方案1
0 2022-12-16 18:17:27