简体繁体 English

使用RBF内核SVM时，c或gamma的高值是否有问题？

[英]Are high values for c or gamma problematic when using an RBF kernel SVM?

原文 2014-04-30 14:41:46 6 2 machine-learning/ nlp/ svm

I'm using WEKA/LibSVM to train a classifier for a term extraction system. 我正在使用WEKA / LibSVM来训练术语提取系统的分类器。 My data is not linearly separable, so I used an RBF kernel instead of a linear one. 我的数据不是线性可分的，因此我使用RBF内核而不是线性内核。
I followed the guide from Hsu et al. 我遵循了Hsu等人的指南。 and iterated over several values for both c and gamma. 并迭代c和gamma的几个值。 The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3. 最适合分类已知术语（测试和训练材料当然不同）的参数相当高，c = 2 ^ 10且γ= 2 ^ 3。
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. 到目前为止，高参数似乎工作正常，但我想知道它们是否会进一步导致任何问题，特别是在过度拟合方面。 I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges. 我计划通过提取新术语来进行另一次评估，但由于我需要人类评判，这些评估费用很高。
Could anything still be wrong with my parameters, even if both evaluation turns out positive? 即使两个评估结果都是肯定的，我的参数仍然有问题吗？ Do I perhaps need another kernel type? 我可能还需要其他内核类型吗？

Thank you very much! 非常感谢你！

2 个解决方案

In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting. 通常，您必须执行交叉验证以回答参数是否正确或是否导致过度拟合。

From the "intuition" perspective - it seems like highly overfitted model. 从“直觉”的角度来看 - 它似乎是高度过度拟合的模型。 High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. 高伽马值意味着你的高斯非常狭窄（在每个尖端周围凝聚），结合高C值将导致记忆大部分训练集。 If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. 如果您查看支持向量的数量，如果它是您整个数据的50％，我不会感到惊讶。 Other possible explanation is that you did not scale your data. 其他可能的解释是您没有扩展数据。 Most ML methods, especially SVM, requires data to be properly preprocessed . 大多数ML方法，尤其是SVM，需要对数据进行适当的预处理 。 This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere. 这尤其意味着您应该对输入数据进行标准化 （标准化），以使其或多或少地包含在单位范围内。

RBF seems like a reasonable choice so I would keep using it. RBF似乎是一个合理的选择，所以我会继续使用它。 A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. 高价值的伽玛不是一件坏事，它取决于你的数据所在的规模。 While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine. 虽然高C值可能导致过度拟合，但它也会受到比例的影响，因此在某些情况下它可能会很好。

If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind. 如果您认为您的数据集是整个数据的良好表示，那么您可以使用交叉验证来测试您的参数并让您高枕无忧。