简体   繁体   中英

Cosine similarity and SVC using scikit-learn

I am trying to utilize the cosine similarity kernel to text classification with SVM with a raw dataset of 1000 words:

# Libraries
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# Data
x_train, x_test, y_train, y_test = train_test_split(raw_data[:, 0], raw_data[:, 1], test_size=0.33, random_state=42)

# CountVectorizer
c = CountVectorizer(max_features=1000, analyzer = "char")
X_train = c.fit_transform(x_train).toarray()
X_test = c.transform(x_test).toarray()

# Kernel
cosine_X_tr = cosine_similarity(X_train)
cosine_X_tst = cosine_similarity(X_test)

# SVM
svm_model = SVC(kernel="precomputed")
svm_model.fit(cosine_X_tr, y_train)
y_pred = svm_model.predict(cosine_X_tst)

But that code throws the following error:

ValueError: X has 330 features, but SVC is expecting 670 features as input

I've tried the following, but I don't know it is mathematically accurate and because also I want to implement other custom kernels not implemented within scikit-learn like histogram intersection:

cosine_X_tst = cosine_similarity(X_test, X_train)

So, basically the main problem resides in the dimensions of the matrix SVC recieves. Once CountVectorizer is applied to train and test datasets those have 1000 features because of max_features parameter:

  • Train dataset of shape (670, 1000)
  • Test dataset of shape (330, 1000)

But after applying cosine similarity are converted to squared matrices:

  • Train dataset of shape (670, 670)
  • Test dataset of shape (330, 330)

When SVC is fitted to train data it learns 670 features and will not be able to predict test dataset because has a different number of features ( 330 ). So, how can i solve that problem and be able to use custom kernels with SVC ?

So, how can i solve that problem and be able to use custom kernels with SVC ?

Define a function yourself, and pass that function to the kernel parmeter in SVC() , like: SVC(kernel=your_custom_function) . See this .


Also, you should use the cosine_similarity kernel like below in your code:

svm_model = SVC(kernel=cosine_similarity)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM