简体繁体 English

如何在sklearn中为svm选择参数

[英]How to choose parameters for svm in sklearn

原文 2017-06-08 16:24:27 9 1 python/ machine-learning/ scikit-learn/ svm

I'm trying to use SVM from sklearn for a classification problem. 我正在尝试使用sklearn中的SVM来解决分类问题。 I got a highly sparse dataset with more than 50K rows and binary outputs. 我得到了一个高度稀疏的数据集，其中包含超过50K行和二进制输出。
The problem is I don't know quite well how to efficiently choose the parameters, mainly the kernel, gamma anc C. 问题是我不太清楚如何有效地选择参数，主要是内核，gamma和c。

For the kernels for example, am I supposed to try all kernels and just keep the one that gives me the most satisfying results or is there something related to our data that we can see in the first place before choosing the kernel ? 例如，对于内核，我是否应该尝试所有内核，只保留给我最满意结果的内核，或者在选择内核之前，我们可以在第一时间看到与我们的数据相关的内容？
Same goes for C and gamma. C和伽玛也一样。

Thanks ! 谢谢！

1 个解决方案

Yes, this is mostly a matter of experimentation -- especially as you've told us very little about your data set: separability, linearity, density, connectivity, ... all the characteristics that affect classification algorithms. 是的，这主要是一个实验问题 - 尤其是你几乎没有告诉我们你的数据集：可分性，线性，密度，连通性......所有影响分类算法的特征。

Try the linear and Gaussian kernels for starters. 尝试使用线性和高斯内核作为初学者。 If linear doesn't work well and Gaussian does, then try the other kernels. 如果线性不能很好地工作而Gaussian有效，那么尝试其他内核。

Once you've found the best 1 or 2 kernels, then play with the cost and gamma parameters. 找到最好的1或2个内核后，再使用cost和gamma参数。 Gamma is a "slack" parameter: it gives the kernel permission to make a certain proportion of raw classification errors as a trade-off for other benefits: width of the gap, simplicity of the partition function, etc. Gamma是一个“松弛”参数：它允许内核允许将一定比例的原始分类错误作为其他好处的权衡：间隙的宽度，分区函数的简单性等。

I haven't yet had an application that got more than trivial benefit from altering the cost. 我还没有一个应用程序从改变成本中获得更多的微不足道的好处。