[英]sklearn - model keeps overfitting
I'm looking for recommendations as to the best way forward for my current machine learning problem 我正在寻找关于我当前机器学习问题的最佳前进方法的建议
The outline of the problem and what I've done is as follows: 问题的概要和我所做的如下:
You can find a shortened version of the code here: http://pastebin.com/Xu13ciL4 你可以在这里找到一个缩短版的代码: http : //pastebin.com/Xu13ciL4
My issues: 我的问题:
Now, this seems like a classic case of overfitting here. 现在,这似乎是过度拟合的经典案例。 However, overfitting here is unlikely to be caused by a disproportionate number of features to samples (32 features, 900 samples).
然而,这里的过度拟合不太可能是由于样本的特征数量不成比例(32个特征,900个样本)。 I've tried a number of things to alleviate this problem:
我已经尝试了很多方法来缓解这个问题:
I'm happy to keep thinking about the problem but at this point I'm looking for a nudge in the right direction. 我很高兴继续思考这个问题,但此时我正在寻找正确方向的推动。 Where might my problem be and what could I do to solve it?
我的问题可能在哪里,我可以做些什么来解决它?
It's entirely possible that my set of features just don't distinguish between the 2 categories, but I'd like to try some other options before jumping to this conclusion. 完全有可能我的一组功能只是不区分这两个类别,但我想在跳到这个结论之前尝试其他一些选项。 Furthermore, if my features don't distinguish then that would explain the low test set scores, but how do you get a perfect training set score in that case?
此外,如果我的功能没有区分,那么这将解释低测试组分数,但在这种情况下如何获得完美的训练集分数? Is that possible?
那可能吗?
I would first try a grid search over the parameter space but while also using a k-fold cross-validation on training set (and keeping the test set to the side of course). 我首先尝试对参数空间进行网格搜索,但同时也在训练集上使用k折交叉验证(并将测试集保持在一边)。 Then pick the set of parameters than generalize the best from the k-fold cross validation.
然后选择一组参数,而不是从k折交叉验证中推广出最佳参数。 I suggest using GridSearchCV with StratifiedKFold (it's already the default strategy for GridSearchCV when passing a classifier as estimator).
我建议将GridSearchCV与StratifiedKFold一起使用(当将分类器作为估计器传递时,它已经是GridSearchCV的默认策略 )。
Hypothetically an SVM with rbf can perfectly fit any training set as VC dimension is infinite. 假设具有rbf的SVM可以完美地适合任何训练集,因为VC维度是无限的。 So if tuning the parameters doesn't help reduce overfitting then you may want to try a similar parameter tuning strategy for a simpler hypothesis such as a linear SVM or another classifier you think may be appropriate for your domain.
因此,如果调整参数无助于减少过度拟合,那么您可能希望针对更简单的假设尝试类似的参数调整策略,例如线性SVM或您认为可能适合您的域的其他分类器。
Regularization as you mentioned is definitely a good idea if its available. 正如你所提到的那样,正规化绝对是个好主意。
The prediction of the same label makes me think that label imbalance may be an issue and for this case you could use different class weights. 对同一标签的预测使我认为标签不平衡可能是一个问题,在这种情况下,您可以使用不同的类权重。 So in the case of an SVM each class gets its own C penalty weight.
因此,在SVM的情况下,每个类都获得其自己的C惩罚权重。 Some estimators in sklearn accept fit params that allow you to set a sample weights to set the amount of penalty for individual training samples.
sklearn中的一些估算器接受拟合参数,允许您设置样本权重以设置单个训练样本的惩罚量。
Now if you think the features may be an issue I would use feature selection by looking at F-values provided by f_classif and could be use with something like SelectKBest. 现在,如果你认为这些特征可能是一个问题,我会通过查看f_classif提供的F值来使用特征选择,并且可以使用像SelectKBest这样的东西。 Another option would be recursive feature elimination with cross validation.
另一种选择是通过交叉验证来消除递归特征。 Feature selection can be wrapped into a grid search as well if you use sklearns Pipeline API.
如果您使用sklearns Pipeline API,则功能选择也可以包含在网格搜索中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.