简体   繁体   English

使用 scikit-learn 的余弦相似度和 SVC

[英]Cosine similarity and SVC using scikit-learn

I am trying to utilize the cosine similarity kernel to text classification with SVM with a raw dataset of 1000 words:我正在尝试使用余弦相似度内核对带有 1000 个单词的原始数据集的 SVM 进行文本分类:

# Libraries
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# Data
x_train, x_test, y_train, y_test = train_test_split(raw_data[:, 0], raw_data[:, 1], test_size=0.33, random_state=42)

# CountVectorizer
c = CountVectorizer(max_features=1000, analyzer = "char")
X_train = c.fit_transform(x_train).toarray()
X_test = c.transform(x_test).toarray()

# Kernel
cosine_X_tr = cosine_similarity(X_train)
cosine_X_tst = cosine_similarity(X_test)

# SVM
svm_model = SVC(kernel="precomputed")
svm_model.fit(cosine_X_tr, y_train)
y_pred = svm_model.predict(cosine_X_tst)

But that code throws the following error:但是该代码会引发以下错误:

ValueError: X has 330 features, but SVC is expecting 670 features as input

I've tried the following, but I don't know it is mathematically accurate and because also I want to implement other custom kernels not implemented within scikit-learn like histogram intersection:我尝试了以下方法,但我不知道它在数学上是否准确,因为我还想实现其他未在scikit-learn中实现的自定义内核,如直方图交集:

cosine_X_tst = cosine_similarity(X_test, X_train)

So, basically the main problem resides in the dimensions of the matrix SVC recieves.因此,基本上主要问题在于 SVC 接收到的矩阵的维度。 Once CountVectorizer is applied to train and test datasets those have 1000 features because of max_features parameter:一旦将CountVectorizer应用于训练和测试由于max_features参数而具有1000 features的数据集:

  • Train dataset of shape (670, 1000)训练形状数据集(670, 1000)
  • Test dataset of shape (330, 1000)形状测试数据集(330, 1000)

But after applying cosine similarity are converted to squared matrices:但在应用余弦相似度后,转换为平方矩阵:

  • Train dataset of shape (670, 670)训练形状数据集(670, 670)
  • Test dataset of shape (330, 330)形状测试数据集(330, 330)

When SVC is fitted to train data it learns 670 features and will not be able to predict test dataset because has a different number of features ( 330 ).SVC适合训练数据时,它会学习670 features ,并且由于具有不同数量的特征( 330个)而无法预测测试数据集。 So, how can i solve that problem and be able to use custom kernels with SVC ?那么,我该如何解决这个问题并能够将自定义内核与SVC一起使用?

So, how can i solve that problem and be able to use custom kernels with SVC ?那么,我该如何解决这个问题并能够将自定义内核与SVC一起使用?

Define a function yourself, and pass that function to the kernel parmeter in SVC() , like: SVC(kernel=your_custom_function) .自己定义一个函数,并将该函数传递给SVC()中的kernel参数,例如: SVC(kernel=your_custom_function) See this .看到这个


Also, you should use the cosine_similarity kernel like below in your code:此外,您应该在代码中使用如下cosine_similarity内核:

svm_model = SVC(kernel=cosine_similarity)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM