简体繁体 English

sklearn - 模型保持过度拟合

[英]sklearn - model keeps overfitting

原文 2015-08-12 05:10:59 6 1 python/ machine-learning/ scikit-learn/ signal-processing/ svm

I'm looking for recommendations as to the best way forward for my current machine learning problem 我正在寻找关于我当前机器学习问题的最佳前进方法的建议

The outline of the problem and what I've done is as follows: 问题的概要和我所做的如下：

I have 900+ trials of EEG data, where each trial is 1 second long. 我有900多个EEG数据试验，每个试验的时间长达1秒。 The ground truth is known for each and classifies state 0 and state 1 (40-60% split) 每个人都知道基本事实，并将状态0和状态1分类（40-60％分裂）
Each trial goes through preprocessing where I filter and extract power of certain frequency bands, and these make up a set of features (feature matrix: 913x32) 每个试验都经过预处理，我可以过滤和提取某些频段的功率，这些组成了一组特征（特征矩阵：913x32）
Then I use sklearn to train the model. 然后我用sklearn训练模型。 cross_validation is used where I use a test size of 0.2. 在我使用0.2的测试大小的地方使用cross_validation。 Classifier is set to SVC with rbf kernel, C = 1, gamma = 1 (I've tried a number of different values) 使用rbf内核将分类器设置为SVC，C = 1，gamma = 1（我尝试了许多不同的值）

You can find a shortened version of the code here: http://pastebin.com/Xu13ciL4 你可以在这里找到一个缩短版的代码： http ： //pastebin.com/Xu13ciL4

My issues: 我的问题：

When I use the classifier to predict labels for my test set, every prediction is 0 当我使用分类器来预测我的测试集的标签时，每个预测都是0
train accuracy is 1, while test set accuracy is around 0.56 列车精度为1，而测试设定精度约为0.56
my learning curve plot looks like this: 我的学习曲线图看起来像这样：

Now, this seems like a classic case of overfitting here. 现在，这似乎是过度拟合的经典案例。 However, overfitting here is unlikely to be caused by a disproportionate number of features to samples (32 features, 900 samples). 然而，这里的过度拟合不太可能是由于样本的特征数量不成比例（32个特征，900个样本）。 I've tried a number of things to alleviate this problem: 我已经尝试了很多方法来缓解这个问题：

I've tried using dimensionality reduction (PCA) in case it is because I have too many features for the number of samples, but accuracy scores and learning curve plot looks the same as above. 我尝试过使用降维（PCA），因为我的样本数量太多了，但是准确度得分和学习曲线图看起来和上面一样。 Unless I set the number of components to below 10, at which point train accuracy begins to drop, but is this not somewhat expected given you're beginning to lose information? 除非我将组件数量设置为低于10，否则列车精度开始下降，但鉴于您开始丢失信息，这是不是有点预期？
I have tried normalizing and standardizing the data. 我已经尝试了规范化和标准化数据。 Standardizing (SD = 1) does nothing to change train or accuracy scores. 标准化（SD = 1）无助于改变训练或准确度分数。 Normalizing (0-1) drops my training accuracy to 0.6. 归一化（0-1）会将训练精度降低到0.6。
I've tried a variety of C and gamma settings for SVC, but they don't change either score 我已经为SVC尝试了各种C和gamma设置，但它们不会改变任何一个分数
Tried using other estimators like GaussianNB, even ensemble methods like adaboost. 尝试使用像GaussianNB这样的其他估算器，甚至像adaboost这样的集合方法。 No change 没变
Tried explcitly setting a regularization method using linearSVC but didn't improve the situation 试图使用linearSVC设置正则化方法，但没有改善这种情况
I tried running the same features through a neural net using theano and my train accuracy is around 0.6, test is around 0.5 我尝试使用theano通过神经网络运行相同的功能，我的列车精度约为0.6，测试值约为0.5

I'm happy to keep thinking about the problem but at this point I'm looking for a nudge in the right direction. 我很高兴继续思考这个问题，但此时我正在寻找正确方向的推动。 Where might my problem be and what could I do to solve it? 我的问题可能在哪里，我可以做些什么来解决它？

It's entirely possible that my set of features just don't distinguish between the 2 categories, but I'd like to try some other options before jumping to this conclusion. 完全有可能我的一组功能只是不区分这两个类别，但我想在跳到这个结论之前尝试其他一些选项。 Furthermore, if my features don't distinguish then that would explain the low test set scores, but how do you get a perfect training set score in that case? 此外，如果我的功能没有区分，那么这将解释低测试组分数，但在这种情况下如何获得完美的训练集分数？ Is that possible? 那可能吗？

1 个解决方案

I would first try a grid search over the parameter space but while also using a k-fold cross-validation on training set (and keeping the test set to the side of course). 我首先尝试对参数空间进行网格搜索，但同时也在训练集上使用k折交叉验证（并将测试集保持在一边）。 Then pick the set of parameters than generalize the best from the k-fold cross validation. 然后选择一组参数，而不是从k折交叉验证中推广出最佳参数。 I suggest using GridSearchCV with StratifiedKFold (it's already the default strategy for GridSearchCV when passing a classifier as estimator). 我建议将GridSearchCV与StratifiedKFold一起使用（当将分类器作为估计器传递时，它已经是GridSearchCV的默认策略）。

Hypothetically an SVM with rbf can perfectly fit any training set as VC dimension is infinite. 假设具有rbf的SVM可以完美地适合任何训练集，因为VC维度是无限的。 So if tuning the parameters doesn't help reduce overfitting then you may want to try a similar parameter tuning strategy for a simpler hypothesis such as a linear SVM or another classifier you think may be appropriate for your domain. 因此，如果调整参数无助于减少过度拟合，那么您可能希望针对更简单的假设尝试类似的参数调整策略，例如线性SVM或您认为可能适合您的域的其他分类器。

Regularization as you mentioned is definitely a good idea if its available. 正如你所提到的那样，正规化绝对是个好主意。

The prediction of the same label makes me think that label imbalance may be an issue and for this case you could use different class weights. 对同一标签的预测使我认为标签不平衡可能是一个问题，在这种情况下，您可以使用不同的类权重。 So in the case of an SVM each class gets its own C penalty weight. 因此，在SVM的情况下，每个类都获得其自己的C惩罚权重。 Some estimators in sklearn accept fit params that allow you to set a sample weights to set the amount of penalty for individual training samples. sklearn中的一些估算器接受拟合参数，允许您设置样本权重以设置单个训练样本的惩罚量。

Now if you think the features may be an issue I would use feature selection by looking at F-values provided by f_classif and could be use with something like SelectKBest. 现在，如果你认为这些特征可能是一个问题，我会通过查看f_classif提供的F值来使用特征选择，并且可以使用像SelectKBest这样的东西。 Another option would be recursive feature elimination with cross validation. 另一种选择是通过交叉验证来消除递归特征。 Feature selection can be wrapped into a grid search as well if you use sklearns Pipeline API. 如果您使用sklearns Pipeline API，则功能选择也可以包含在网格搜索中。