Support vector machine overfitting my data

Question

I am trying to make predictions for the iris dataset. I have decided to use svms for this purpose. But, it gives me an accuracy 1.0. Is it a case of overfitting or is it because the model is very good? Here is my code.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1,gamma='auto')
svm_model.fit(X_train,y_train)
predictions = svm_model.predict(X_test)
accuracy_score(predictions, y_test)

Here, accuracy_score returns a value of 1. Please help me. I am a beginner in machine learning.

Answer 1

The iris dataset is not a particularly difficult one from where to get good results. However, you are right not trusting a 100% classification accuracy model. In your example, the problem is that the 30 test points are all correctly well classified. But that doesn't mean that your model is able to generalise well for all new data instances. Just try and change the test_size to 0.3 and the results are no longer 100% (it goes down to 97.78%).

The best way to guarantee robustness and avoid overfitting is using cross validation. An example on how to do this easily from your example:

from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


iris = datasets.load_iris()
X = iris.data[:, :4]  
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

svm_model = svm.SVC(kernel='linear', C=1, gamma='auto')
scores = cross_val_score(svm_model, iris.data, iris.target, cv=10) #10 fold cross validation

Here cross_val_score uses different parts of the dataset as testing data iteratively (cross validation) while keeping all your previous parameters. If you check score you will see that the 10 accuracies calculated now range from 87.87% to 100%. To report the final model performance you can for example use the mean of the scored values.

Hope this helps and good luck! :)

Answer 2

You can try cross validation :

Example:

from sklearn.model_selection import LeaveOneOut
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

#load iris data
iris = datasets.load_iris()
X = iris.data
Y = iris.target

#build the model
svm_model = SVC( kernel ='linear', C = 1, gamma = 'auto',random_state = 0 )

#create the Cross validation object
loo = LeaveOneOut()

#calculate cross validated (leave one out) accuracy score
scores = cross_val_score(svm_model, X,Y, cv = loo, scoring='accuracy')

print( scores.mean() )

Result (the mean accuracy of the 150 folds since we used leave-one-out):

0.97999999999999998

Bottom line:

Cross validation (especially LeaveOneOut ) is a good way to avoid overfitting and to get robust results.

Support vector machine overfitting my data

Question

2 answers

solution1
2 2017-07-06 08:40:19

solution2
1 ACCPTED 2017-07-06 08:11:50

Support vector machine overfitting my data

Question

2 answers

solution1 2 2017-07-06 08:40:19

solution2 1 ACCPTED 2017-07-06 08:11:50

solution1
2 2017-07-06 08:40:19

solution2
1 ACCPTED 2017-07-06 08:11:50