简体繁体 English

有没有办法在 sci-kit 学习交叉验证中定义每个 label 的分数？

[英]Is there a way to define the fraction of each label I want in sci-kit learn cross validation?

原文 2020-05-28 13:31:28 9 1 python/ scikit-learn/ cross-validation

I've written a simple Python script that uses sklearn.neural_network.MLPClassifier and sklearn.model_selection.GridSearchCV to make predictions about binary classification data, each point being labelled either 0 or 1. In the training data, roughly 90% have the label 1 and 10% have the label 0. In the test data, roughly 35% have the label 1 and 65% have the label 0. This proportion is known, although the labels aren't known.我编写了一个简单的 Python 脚本，该脚本使用sklearn.neural_network.MLPClassifier和sklearn.model_selection.GridSearchCV对二进制分类数据进行预测，每个点都标记为 0 或 1。在训练数据中，大约 90% 有 ZD304BA20E96D87411588EEC850E1和 10% 有 label 0。在测试数据中，大约 35% 有 label 1 和 65% 有 label't 0。这个比例是已知的。

My model is currently over-fitting.我的 model 目前过拟合。 My cross-validation score for the training data is 85-90%, but the score when I run the code on the test set is below 40%.我对训练数据的交叉验证分数是 85-90%，但我在测试集上运行代码时的分数低于 40%。

One workaround I've thought of is that I could try setting GridSearchCV to split the data so that each training/validation set has approximately the same proportion of labels as the test data.我想到的一种解决方法是，我可以尝试设置GridSearchCV来拆分数据，以便每个训练/验证集具有与测试数据大致相同比例的标签。 This doesn't seem to be an option with this library however, and my google-fu hasn't returned any results in terms of other sci-kit learn programmes.然而，这似乎不是这个库的一个选项，而且我的 google-fu 没有返回任何关于其他 sci-kit 学习程序的结果。

Are there any other libraries I could use, or a parameter I could input that I haven't managed to find?有没有我可以使用的其他库，或者我可以输入我没有找到的参数？ Thank you.谢谢你。

1 个解决方案

I would suggest the imblearn library, as it offers a great variety of methods for re-sampling.我建议使用imblearn库，因为它提供了多种重新采样方法。 I do not know the size or other specifics of your data set, but in general, I would argue that oversampling strategies should be favored over undersampling ones.我不知道您的数据集的大小或其他细节，但总的来说，我认为过采样策略应该比欠采样策略更受青睐。 You could for example use SMOTE to oversample your 0 labels in the training set.例如，您可以使用SMOTE对训练集中的 0 个标签进行过采样。 The sampling_strategy parameter also allows you to specify your desired ratio beforehand. sampling_strategy参数还允许您预先指定所需的比率。