简体   繁体   English

如何正确拆分包含训练测试和交叉验证集的不平衡数据集

[英]How to correctly split unbalanced dataset incorporating train test and cross validation set

这是我试图在抽样方面复制的文本部分。 The picture above is what I'm trying to replicate.上面的图片是我想要复制的。 I just don't know if I'm going about it the right way.我只是不知道我是否以正确的方式去做。 I'm working with the FakeNewsChallenge dataset and its extremely unbalanced, and I'm trying to replicate and improve on a method used in a paper.我正在使用 FakeNewsChallenge 数据集及其极其不平衡的数据,我正在尝试复制和改进论文中使用的方法。


Agree - 7.36%同意 - 7.36%

Disagree - 1.68%不同意 - 1.68%

Discuss - 17.82%讨论 - 17.82%

Unrelated - 73.13%不相关 - 73.13%

I'm splitting the data in this way:我以这种方式拆分数据:

(split dataset into 67/33 split) (将数据集拆分为 67/33 拆分)

  • train 67%, test 33%训练 67%,测试 33%

(split training further 80/20 for validation) (将训练进一步拆分为 80/20 以进行验证)

  • training 80%, validation 20%训练 80%,验证 20%

(Then split training and validation using 3 fold cross validation set) (然后使用 3 折交叉验证集拆分训练和验证)

As an aside, getting that 1.68% of disagree and agree has been extremely difficult.顺便说一句,获得 1.68% 的不同意和同意是非常困难的。


This is where I'm having an issue as it's not making total sense to me.这就是我遇到问题的地方,因为它对我来说并不完全有意义。 Is the validation set created in the 80/20 split being stratified as well in the 5fold?在 80/20 拆分中创建的验证集是否也在 5 折中分层?

Here is where I am at currently:这是我目前所在的位置:


Split data into 67% Training Set and 33% Test Set将数据分成 67% 的训练集和 33% 的测试集

x_train1, x_test, y_train1, y_test = train_test_split(x, y, test_size=0.33)

x_train2, x_val, y_train2, y_val = train_test_split(x_train1, y_train1, test_size=0.20)

skf = StratifiedKFold(n_splits=3, shuffle = True)
skf.getn_splits(x_train2, y_train2)

for train_index, test_index in skf.split(x_train2, y_train2):
  x_train_cros, x_test_cros = x_train2[train_index], x_train2[test_index]
  y_train_cros, y_test_cros = y_train2[train_index], y_train[test_index]

Would I run skf again for the validation set as well?我会再次为验证集运行 skf 吗? Where are the test test sets from skf created being used in sequential model?在顺序 model 中使用的 skf 创建的测试测试集在哪里?


Citation for the method I'm using:引用我正在使用的方法:

Thota, Aswini;托塔,阿斯维尼; Tilak, Priyanka;蒂拉克,普里扬卡; Ahluwalia, Simrat;阿鲁瓦利亚,西姆拉特; and Lohia, Nibrat (2018) "Fake News Detection: A Deep Learning Approach," SMU Data Science Review: Vol.和 Lohia, Nibrat(2018 年)“假新闻检测:一种深度学习方法”,SMU 数据科学评论:卷。 1: No. 3, Article 10. Available at: https://scholar.smu.edu/datasciencereview/vol1/iss3/10 1:第 3 条,第 10 条。可在: https://scholar.smu.edu/datasciencereview/vol1/iss3/10

You need to add one more parameter in the function 'train_test_split()':您需要在 function 'train_test_split()' 中再添加一个参数:

x_train1, x_test, y_train1, y_test = train_test_split(x, y, test_size=0.33, stratify = y)

This will give you equal distribution of all target categories.这将为您提供所有目标类别的平等分布。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将此数据集拆分为训练集、验证集和测试集? - How can I split this dataset into train, validation, and test set? 带有交叉验证的训练集拆分和测试集拆分的分数 - Scores for train set split and ​test set split with cross validation 如何将此数据集拆分为训练集和验证集? - how to split this dataset into train and validation set? 如何在 Python 脚本中将 tensorflow 数据集拆分为训练、测试和验证? - How to split a tensorflow dataset into train, test and validation in a Python script? 拆分测试和训练数据集的交叉验证 - cross validation for split test and train datasets 如何正确拆分不平衡数据集以训练和测试集? - How can I properly split imbalanced dataset to train and test set? 如何使用 Python Numpy 中的 train_test_split 将数据拆分为训练、测试和验证数据集? 分裂不应该是随机的 - How to split data by using train_test_split in Python Numpy into train, test and validation data set? The split should not random scikit学习:5折交叉验证和培训测试分组 - scikit learn: 5 fold cross validation & train test split KFolds交叉验证与train_test_split - KFolds Cross Validation vs train_test_split 从训练测试拆分到使用管道在 sklearn 中进行交叉验证 - From train test split to cross validation in sklearn using pipeline
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM