简体   繁体   English

有没有办法在 sci-kit 学习交叉验证中定义每个 label 的分数?

[英]Is there a way to define the fraction of each label I want in sci-kit learn cross validation?

I've written a simple Python script that uses sklearn.neural_network.MLPClassifier and sklearn.model_selection.GridSearchCV to make predictions about binary classification data, each point being labelled either 0 or 1. In the training data, roughly 90% have the label 1 and 10% have the label 0. In the test data, roughly 35% have the label 1 and 65% have the label 0. This proportion is known, although the labels aren't known.我编写了一个简单的 Python 脚本,该脚本使用sklearn.neural_network.MLPClassifiersklearn.model_selection.GridSearchCV对二进制分类数据进行预测,每个点都标记为 0 或 1。在训练数据中,大约 90% 有 ZD304BA20E96D87411588EEC850E1和 10% 有 label 0。在测试数据中,大约 35% 有 label 1 和 65% 有 label't 0。这个比例是已知的。

My model is currently over-fitting.我的 model 目前过拟合。 My cross-validation score for the training data is 85-90%, but the score when I run the code on the test set is below 40%.我对训练数据的交叉验证分数是 85-90%,但我在测试集上运行代码时的分数低于 40%。

One workaround I've thought of is that I could try setting GridSearchCV to split the data so that each training/validation set has approximately the same proportion of labels as the test data.我想到的一种解决方法是,我可以尝试设置GridSearchCV来拆分数据,以便每个训练/验证集具有与测试数据大致相同比例的标签。 This doesn't seem to be an option with this library however, and my google-fu hasn't returned any results in terms of other sci-kit learn programmes.然而,这似乎不是这个库的一个选项,而且我的 google-fu 没有返回任何关于其他 sci-kit 学习程序的结果。

Are there any other libraries I could use, or a parameter I could input that I haven't managed to find?有没有我可以使用的其他库,或者我可以输入我没有找到的参数? Thank you.谢谢你。

I would suggest the imblearn library, as it offers a great variety of methods for re-sampling.我建议使用imblearn库,因为它提供了多种重新采样方法。 I do not know the size or other specifics of your data set, but in general, I would argue that oversampling strategies should be favored over undersampling ones.我不知道您的数据集的大小或其他细节,但总的来说,我认为过采样策略应该比欠采样策略更受青睐。 You could for example use SMOTE to oversample your 0 labels in the training set.例如,您可以使用SMOTE对训练集中的 0 个标签进行过采样。 The sampling_strategy parameter also allows you to specify your desired ratio beforehand. sampling_strategy参数还允许您预先指定所需的比率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM