不平衡数据集的训练/验证/测试集

Question

I am working in an NLP task for a classification problem.我正在处理分类问题的 NLP 任务。 My dataset is imbalanced and some authors have 1 only text, thus I want to have this text only in the training test.我的数据集不平衡，一些作者只有 1 个文本，因此我希望仅在训练测试中使用此文本。 As for the other authors I have to have a spliting of 70%, 15% and 15% respectivelly.至于其他作者，我必须分别分配 70%、15% 和 15%。

I tried to use train_test_split function from sklearn, but the results aren't good.我尝试使用 sklearn 的 train_test_split function，但结果并不好。

My dataset is a dataframe and it looks like this我的数据集是 dataframe，它看起来像这样

Title Preprocessed_Text Label标题 Preprocessed_Text Label

Please let me know.请告诉我。

Answer 1

Whit only One sample of a particular class it seems impossible to measure the classification performance on this class.只有一个特定 class 的样本，似乎无法测量此 class 的分类性能。 So I recommend using one or more oversampling approaches to overcome the imbalance problem ([a hands-on article on it][1]).所以我建议使用一种或多种过采样方法来克服不平衡问题（[关于它的动手文章][1]）。 As a matter of fact, you must pay more attention to splitting the data in such a way that preserves the prior probability of each class (for example by setting the stratify argument in train_test_split ).事实上，您必须更加注意以保留每个 class 的先验概率的方式拆分数据（例如通过在train_test_split stratify ）。 In addition, there are some considerations about the scoring method you must take into account (for example accuracy is not the best fit for scoring).此外，您必须考虑一些关于评分方法的注意事项（例如， accuracy并不是最适合评分的方法）。

Answer 2

It is rather hard to obtain good classification results for a class that contains only 1 instance (at least for that specific class).对于仅包含 1 个实例（至少对于该特定类）的 class，很难获得良好的分类结果。 Regardless, for imbalanced datasets, one should use stratified train_test_split, which preserves the same proportions of instances in each class as observed in the original dataset.无论如何，对于不平衡的数据集，应该使用分层的 train_test_split，它在每个 class 中保留与在原始数据集中观察到的相同比例的实例。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

I should also add that if the dataset is rather small, let's say no more than 100 instances, it would be preferable to use cross-validation instead of train_test_split, and more specifically, StratifiedKFold that returns stratified folds.我还应该补充一点，如果数据集相当小，比如说不超过 100 个实例，最好使用交叉验证而不是 train_test_split，更具体地说，使用返回分层折叠的StratifiedKFold 。

When it comes to evaluation, you should consider using metrics such as Precision, Recall and F1-score (the harmonic mean of the Precision and Recall), however, using the average weighted score for each of these, which uses a weight that depends on the number of true instances of each class.在评估时，您应该考虑使用诸如 Precision、Recall 和 F1 分数（Precision 和 Recall 的调和平均值）等指标，但是，使用这些指标的平均加权分数，它使用的权重取决于每个 class 的真实实例数。 As per the documentation根据文档

'weighted': “加权”：

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).计算每个 label 的指标，并找到它们的平均加权支持度（每个标签的真实实例数）。 This alters 'macro' to account for label imbalance;这会改变“宏”以解决 label 不平衡； it can result in an F-score that is not between precision and recall.它可能导致 F 分数不在精确率和召回率之间。

不平衡数据集的训练/验证/测试集

问题描述

2 个解决方案

解决方案1
0 2022-01-25 14:23:30

解决方案2
0 2022-01-25 14:39:39

不平衡数据集的训练/验证/测试集

问题描述

2 个解决方案

解决方案1 0 2022-01-25 14:23:30

解决方案2 0 2022-01-25 14:39:39

解决方案1
0 2022-01-25 14:23:30

解决方案2
0 2022-01-25 14:39:39