I have a multiclass classification problem. My dataset (let's call data X
and labels - y
) represents sets of points on 640x480 images, so all elements in X
are integers in range of valid pixels. I'm trying to use SVM for this problem. If I run SVM against dataset as is , it gives accuracy of 74% . However, if I scale data to range [0..1]
, it gives much poorer results - only 69% of correct results.
I double checked histogram of elements in X
and its scaled version Xs
, and they are identical. So data is not corrupted, just normalized. Knowing ideas behind SVM I assumed scaling should not affect results, but it does. So why does it happen?
Here's my code in case I made mistake in it:
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.svm import SVC
>>>
>>> X, y = ...
>>> Xs = X.astype(np.float32) / (X.max() - X.min())
>>> cross_val_score(SVC(kernel='linear'), X, y, cv=10).mean()
0.74531073446327667
>>> cross_val_score(SVC(kernel='linear'), Xs, y, cv=10).mean()
0.69485875706214695
Scaling should certainly affect results, but it should improve them. However, the performance of an SVM is critically dependent on its C
setting, which trades off the cost of misclassification on the training set vs. model simplicity, and which should be determined using eg grid search and nested cross-validation . The default settings are very rarely optimal for any given problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.