SVM: scaled dataset gives worse results?

Question

I have a multiclass classification problem. My dataset (let's call data X and labels - y ) represents sets of points on 640x480 images, so all elements in X are integers in range of valid pixels. I'm trying to use SVM for this problem. If I run SVM against dataset as is , it gives accuracy of 74% . However, if I scale data to range [0..1] , it gives much poorer results - only 69% of correct results.

I double checked histogram of elements in X and its scaled version Xs , and they are identical. So data is not corrupted, just normalized. Knowing ideas behind SVM I assumed scaling should not affect results, but it does. So why does it happen?

Here's my code in case I made mistake in it:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.svm import SVC
>>> 
>>> X, y = ...
>>> Xs = X.astype(np.float32) / (X.max() - X.min())    
>>> cross_val_score(SVC(kernel='linear'), X, y, cv=10).mean()
0.74531073446327667
>>> cross_val_score(SVC(kernel='linear'), Xs, y, cv=10).mean()
0.69485875706214695

Answer 1

Scaling should certainly affect results, but it should improve them. However, the performance of an SVM is critically dependent on its C setting, which trades off the cost of misclassification on the training set vs. model simplicity, and which should be determined using eg grid search and nested cross-validation . The default settings are very rarely optimal for any given problem.

SVM: scaled dataset gives worse results?

Question

1 answers

solution1
1 ACCPTED 2014-03-21 08:40:37

SVM: scaled dataset gives worse results?

Question

1 answers

solution1 1 ACCPTED 2014-03-21 08:40:37

solution1
1 ACCPTED 2014-03-21 08:40:37