简体   繁体   中英

Optimal SVM parameters for high recall

I'm using scikit-learn to perform classification using SVM. I'm performing a binary classification task.

0 : Does not belong to class A
1 : Belongs to class A

Now, I want to optimize the parameters such that I get high recall. I don't care much about a few false positives but the objects belonging to class A should not be labelled as not belonging to A often.

I use a SVM with linear kernel.

from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X,Y)
clf.predict(...)  

How should I choose other SVM parameters like C ? Also, what is the difference between SVC with a linear kernel and LinearSVC ?

The choice of the kernel is really dependent on the data, so picking the kernel based on a plot of the data might be the way to go. This could be automated by running through all kernel types and picking the one that gives you either high/low recall or bias, whatever you're looking for. You can see for yourself the visual difference of the kernels .

Depending on the kernel different arguments of the SVC constructor are important, but in general the C is possibly the most influential, as it's the penalty for getting it wrong. Decreasing C would increase the recall.

Other than that there's more ways to get a better fit, for example by adding more features to the n_features of the X matrix passed on to svm.fit(X,y).

And of course it can always be useful to plot the precision/recall to get a better feel of what the parameters are doing .

Generally speaking you can tackle this problem by penalizing the two types of errors differently during the learning procedure. If you take a look at the loss function, in particular in the primal/parametric setting, you can think of scaling the penalty of false-negatives by alpha and penalty of false-positives by (1 - alpha), where alpha is in [0 1]. (To similar effect would be duplicating the number of positive instances in your training set, but this makes your problem unnecessarily larger, which should be avoided for efficiency)

You can choose the SVM parameter C, which is basically your penalty term, by cross-validation. Here you can use K-Fold cross-validation. You can also use a sklearn class called gridsearchCV in which you can pass your model and then perform cross-validation on it using the cv parameter.

According to linearSVC documentation - Similar to SVC with parameter kernel='linear', but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM