如何正确拆分不平衡数据集以训练和测试集？

Question

I have a flight delay dataset and try to split the set to train and test set before sampling.我有一个航班延误数据集，并尝试在采样前将数据集拆分为训练集和测试集。 On-time cases are about 80% of total data and delayed cases are about 20% of that.准时案例约占总数据的 80%，延迟案例约占其中的 20%。

Normally in machine learning ratio of train and test set size is 8:2.通常在机器学习中训练集和测试集大小的比例是 8:2。 But the data is too imbalanced.但是数据太不平衡了。 So considering extreme case, most of train data are on-time cases and most of test data are delayed cases and accuracy will be poor.所以考虑到极端情况，大部分训练数据是准时情况，大部分测试数据是延迟情况，准确性会很差。

So my question is How can I properly split imbalanced dataset to train and test set??所以我的问题是如何正确分割不平衡的数据集来训练和测试集？

Answer 1

Probably just by playing with ratio of train and test you might not get the correct prediction and results.可能只是通过使用训练和测试的比率，您可能无法获得正确的预测和结果。

if you are working on imbalanced dataset, you should try re-sampling technique to get better results.如果您正在处理不平衡的数据集，您应该尝试重新采样技术以获得更好的结果。 In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.在不平衡数据集的情况下，分类器总是“预测”最常见的类，而不对特征进行任何分析。

Also use different metric for performance measurement such as F1 Score etc in case of imbalanced data set在数据集不平衡的情况下，还可以使用不同的指标来衡量性能，例如 F1 分数等

Please go through the below link, it will give you more clarity.请通过下面的链接，它会让你更清楚。

What is the correct procedure to split the Data sets for classification problem? 为分类问题拆分数据集的正确程序是什么？

Cleveland heart disease dataset - can't describe the class 克利夫兰心脏病数据集——无法描述类

Answer 2

Start from 50/50 and go on changing the sets as 60/40, 70/30, 80/20, 90/10.从 50/50 开始，然后继续将组更改为 60/40、70/30、80/20、90/10。 declare all the results and come to some conclusion.宣布所有结果并得出一些结论。 In one of my work on Flight delays prediction project, I used 60/40 database and got 86.8 % accuracy using MLP NN.在我的一项航班延误预测项目中，我使用了 60/40 数据库并使用 MLP NN 获得了 86.8% 的准确率。

Answer 3

There are two approaches that you can take.您可以采用两种方法。

A simple one: no preprocessing of the dataset but careful sampling of the dataset so that both classes are represented in the same proportion in the test and train subsets.一个简单的方法：没有对数据集进行预处理，而是对数据集进行仔细采样，以便在测试和训练子集中以相同的比例表示两个类。 You can do it by splitting by class first and then randomly sampling from both sets.您可以通过先按类拆分然后从两组中随机抽样来实现。

 import sklearn XclassA = dataX[0] # TODO: change to split by class XclassB = dataX[1] YclassA = dataY[0] YclassB = dataY[1] XclassA_train, XclassA_test, YclassA_train, YclassA_test = sklearn.model_selection.train_test_split(XclassA, YclassA, test_size=0.2, random_state=42) XclassB_train, XclassB_test, YclassB_train, YclassB_test = sklearn.model_selection.train_test_split(XclassB, YclassB, test_size=0.2, random_state=42) Xclass_train = XclassA_train + XclassB_train Yclass_train = YclassA_train + YclassB_train

A more involved, and arguably better one, you can try first to balance your dataset.一个更复杂的，可以说是更好的，你可以先尝试平衡你的数据集。 For that you can use one of many techniques (under-, over-sampling, SMOTE, AdaSYN, Tomek links, etc.).为此，您可以使用多种技术之一（欠采样、过采样、SMOTE、AdaSYN、Tomek 链接等）。 I recommend you review the methods of imbalanced-learn package.我建议您查看imbalanced-learn包的方法。 Having done balancing you can use the ordinary test/train split using typical methods without any additional intermediary steps.完成平衡后，您可以使用典型方法使用普通的测试/训练拆分，而无需任何额外的中间步骤。

The second approach is better not only from the perspective of splitting the data but also from the speed and even ability to train a model (which for heavily imbalanced datasets is not guaranteed to work).第二种方法不仅从拆分数据的角度来看更好，而且从速度甚至训练模型的能力（对于严重不平衡的数据集不能保证工作）的角度来看都更好。

如何正确拆分不平衡数据集以训练和测试集？

问题描述

3 个解决方案

解决方案1
2 已采纳 2019-07-27 23:24:37

解决方案2
0 2019-07-27 08:33:55

解决方案3
0 2020-02-12 13:09:05

如何正确拆分不平衡数据集以训练和测试集？

问题描述

3 个解决方案

解决方案1 2 已采纳 2019-07-27 23:24:37

解决方案2 0 2019-07-27 08:33:55

解决方案3 0 2020-02-12 13:09:05

解决方案1
2 已采纳 2019-07-27 23:24:37

解决方案2
0 2019-07-27 08:33:55

解决方案3
0 2020-02-12 13:09:05