简体   繁体   English

使用RBF内核SVM时,c或gamma的高值是否有问题?

[英]Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. 我正在使用WEKA / LibSVM来训练术语提取系统的分类器。 My data is not linearly separable, so I used an RBF kernel instead of a linear one. 我的数据不是线性可分的,因此我使用RBF内核而不是线性内核。
I followed the guide from Hsu et al. 我遵循了Hsu等人指南。 and iterated over several values for both c and gamma. 并迭代c和gamma的几个值。 The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3. 最适合分类已知术语(测试和训练材料当然不同)的参数相当高,c = 2 ^ 10且γ= 2 ^ 3。
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. 到目前为止,高参数似乎工作正常,但我想知道它们是否会进一步导致任何问题,特别是在过度拟合方面。 I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges. 我计划通过提取新术语来进行另一次评估,但由于我需要人类评判,这些评估费用很高。
Could anything still be wrong with my parameters, even if both evaluation turns out positive? 即使两个评估结果都是肯定的,我的参数仍然有问题吗? Do I perhaps need another kernel type? 我可能还需要其他内核类型吗?

Thank you very much! 非常感谢你!

In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting. 通常,您必须执行交叉验证以回答参数是否正确或是否导致过度拟合。

From the "intuition" perspective - it seems like highly overfitted model. 从“直觉”的角度来看 - 它似乎是高度过度拟合的模型。 High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. 高伽马值意味着你的高斯非常狭窄(在每个尖端周围凝聚),结合高C值将导致记忆大部分训练集。 If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. 如果您查看支持向量的数量,如果它是您整个数据的50%,我不会感到惊讶。 Other possible explanation is that you did not scale your data. 其他可能的解释是您没有扩展数据。 Most ML methods, especially SVM, requires data to be properly preprocessed . 大多数ML方法,尤其是SVM,需要对数据进行适当的预处理 This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere. 这尤其意味着您应该对输入数据进行标准化 (标准化),以使其或多或少地包含在单位范围内。

RBF seems like a reasonable choice so I would keep using it. RBF似乎是一个合理的选择,所以我会继续使用它。 A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. 高价值的伽玛不是一件坏事,它取决于你的数据所在的规模。 While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine. 虽然高C值可能导致过度拟合,但它也会受到比例的影响,因此在某些情况下它可能会很好。

If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind. 如果您认为您的数据集是整个数据的良好表示,那么您可以使用交叉验证来测试您的参数并让您高枕无忧。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM