GridSearchCV performs worse than vanilla SVM using the SAME parameters

Question

community.

I was coding some ML to classify some data into groups.

I tried different methods, but when I got to SVM I encountered this problem.

I have a simple set of data (3 classes, 6 features) and when I use SVM with fixed parameters ( C=10 , gamma=1 ) and I predict on the same data I get 100% accuracy (these could be overfitted vectors, but that's another issue).

What I find difficult to understand is that then I try GridSearchCV (sklearn.model_selection.GridSearchCV) and I sweep over all powers of 10 from 10^-5 to 10^5 for C and gamma (this includes of course C=10 and gamma=1) and it finds as best_params: C=10^-5 and gamma=10^-5 . With those parameters, the accuracy is 41% and all of the predictions fall into one category .

At least I should be able to predict the same parameters as the FIXED SVM. What is also puzzling is that the same code worked before for other datasets...

My issue NOW is NOT (so please leave these discussions aside if you answer):

overfitting, or the use of the same data for training and testing;
an unbalanced set of data;
dataset issues.

My issue is only WHY do GridSearchCV behaves differently than normal SVM. I am sure it must be something I am coding wrong, or else they really don't work as expected.

Here is the code:

import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing

xl= pd.ExcelFile('3classes_6_features.xlsx')
cont= xl.parse()

 # Encode labels:
labeling = preprocessing.LabelEncoder()
labeling.fit(cont[cont.keys()[0]])

### cont.keys()[0] == "GS"
y_all= labeling.transform(np.array(cont["GS"]))
X_all= np.array(cont.drop(["GS"],1))

# NORMAL SVM:
SVMclassifier= svm.SVC(kernel='rbf', gamma=1, C=10, random_state=0)
SVMclassifier.fit(X_all,y_all)

# SVM with HYPERPARAMETRIC TUNING:
log_sweep= [10**(i)/j for i in range(-5,5) for j in [2,1]]
SVMparam_grid = {'C': log_sweep, 'gamma': log_sweep}

SVMgrid_classifier= GridSearchCV(svm.SVC(kernel='rbf', random_state=0), SVMparam_grid)
SVMgrid_classifier= SVMgrid_classifier.fit(X_all,y_all)


print("INITIAL CLASSES: ", y_all)
print("NORMAL SVM prediction: ", SVMclassifier.predict(X_all))
print("TUNED SVM prediction: ", SVMgrid_classifier.predict(X_all))

The result is:

INITIAL CLASSES: [0 1 2 2 0 0 2 0 1 0 0 1 2 0 0 1 1 1 2 1 2]

NORMAL SVM prediction: [0 1 2 2 0 0 2 0 1 0 0 1 2 0 0 1 1 1 2 1 2]

TUNED SVM prediction: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

I attach the data in case you want to try it yourselves:

https://drive.google.com/open?id=1LPRiMFNeSXu790lG_-NP3HAkvq8vyLcV

Thanks in advance!

Regards,

Luke

Answer 1

The problem is that when you do fit and predict with the 'normal SVM' you used all the data to train and then you predict on the same data.

When you use GridSearchCV it performs K-fold cross-validation by default (check parameter 'cv') which means that it will split the data into train and validation so the model never trains with the whole data set and is then predicting (validating) on data that it did not train with - the best parameters are based on the highest score it gets from the validation step.

It then selects the best model that resulted from the cross-validation to perform the prediction.

GridSearchCV performs worse than vanilla SVM using the SAME parameters

Question

1 answers

solution1
2 2018-10-22 18:43:19

GridSearchCV performs worse than vanilla SVM using the SAME parameters

Question

1 answers

solution1 2 2018-10-22 18:43:19

solution1
2 2018-10-22 18:43:19