简体   繁体   English

不平衡数据集的训练/验证/测试集

[英]Train/Validation/Testing sets for imbalanced dataset

I am working in an NLP task for a classification problem.我正在处理分类问题的 NLP 任务。 My dataset is imbalanced and some authors have 1 only text, thus I want to have this text only in the training test.我的数据集不平衡,一些作者只有 1 个文本,因此我希望仅在训练测试中使用此文本。 As for the other authors I have to have a spliting of 70%, 15% and 15% respectivelly.至于其他作者,我必须分别分配 70%、15% 和 15%。

I tried to use train_test_split function from sklearn, but the results aren't good.我尝试使用 sklearn 的 train_test_split function,但结果并不好。

My dataset is a dataframe and it looks like this我的数据集是 dataframe,它看起来像这样

Title Preprocessed_Text Label标题 Preprocessed_Text Label


Please let me know.请告诉我。

Whit only One sample of a particular class it seems impossible to measure the classification performance on this class.只有一个特定 class 的样本,似乎无法测量此 class 的分类性能。 So I recommend using one or more oversampling approaches to overcome the imbalance problem ([a hands-on article on it][1]).所以我建议使用一种或多种过采样方法来克服不平衡问题([关于它的动手文章][1])。 As a matter of fact, you must pay more attention to splitting the data in such a way that preserves the prior probability of each class (for example by setting the stratify argument in train_test_split ).事实上,您必须更加注意以保留每个 class 的先验概率的方式拆分数据(例如通过在train_test_split stratify )。 In addition, there are some considerations about the scoring method you must take into account (for example accuracy is not the best fit for scoring).此外,您必须考虑一些关于评分方法的注意事项(例如, accuracy并不是最适合评分的方法)。

It is rather hard to obtain good classification results for a class that contains only 1 instance (at least for that specific class).对于仅包含 1 个实例(至少对于该特定类)的 class,很难获得良好的分类结果。 Regardless, for imbalanced datasets, one should use stratified train_test_split, which preserves the same proportions of instances in each class as observed in the original dataset.无论如何,对于不平衡的数据集,应该使用分层的 train_test_split,它在每个 class 中保留与在原始数据集中观察到的相同比例的实例。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

I should also add that if the dataset is rather small, let's say no more than 100 instances, it would be preferable to use cross-validation instead of train_test_split, and more specifically, StratifiedKFold that returns stratified folds.我还应该补充一点,如果数据集相当小,比如说不超过 100 个实例,最好使用交叉验证而不是 train_test_split,更具体地说,使用返回分层折叠的StratifiedKFold

When it comes to evaluation, you should consider using metrics such as Precision, Recall and F1-score (the harmonic mean of the Precision and Recall), however, using the average weighted score for each of these, which uses a weight that depends on the number of true instances of each class.在评估时,您应该考虑使用诸如 Precision、Recall 和 F1 分数(Precision 和 Recall 的调和平均值)等指标,但是,使用这些指标的平均加权分数,它使用的权重取决于每个 class 的真实实例数。 As per the documentation根据文档

'weighted': “加权”:

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).计算每个 label 的指标,并找到它们的平均加权支持度(每个标签的真实实例数)。 This alters 'macro' to account for label imbalance;这会改变“宏”以解决 label 不平衡; it can result in an F-score that is not between precision and recall.它可能导致 F 分数不在精确率和召回率之间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM