如何正确拆分包含训练测试和交叉验证集的不平衡数据集

Question

The picture above is what I'm trying to replicate.上面的图片是我想要复制的。 I just don't know if I'm going about it the right way.我只是不知道我是否以正确的方式去做。 I'm working with the FakeNewsChallenge dataset and its extremely unbalanced, and I'm trying to replicate and improve on a method used in a paper.我正在使用 FakeNewsChallenge 数据集及其极其不平衡的数据，我正在尝试复制和改进论文中使用的方法。

Agree - 7.36%同意 - 7.36%

Disagree - 1.68%不同意 - 1.68%

Discuss - 17.82%讨论 - 17.82%

Unrelated - 73.13%不相关 - 73.13%

I'm splitting the data in this way:我以这种方式拆分数据：

(split dataset into 67/33 split) （将数据集拆分为 67/33 拆分）

train 67%, test 33%训练 67%，测试 33%

(split training further 80/20 for validation) （将训练进一步拆分为 80/20 以进行验证）

training 80%, validation 20%训练 80%，验证 20%

(Then split training and validation using 3 fold cross validation set) （然后使用 3 折交叉验证集拆分训练和验证）

As an aside, getting that 1.68% of disagree and agree has been extremely difficult.顺便说一句，获得 1.68% 的不同意和同意是非常困难的。

This is where I'm having an issue as it's not making total sense to me.这就是我遇到问题的地方，因为它对我来说并不完全有意义。 Is the validation set created in the 80/20 split being stratified as well in the 5fold?在 80/20 拆分中创建的验证集是否也在 5 折中分层？

Here is where I am at currently:这是我目前所在的位置：

Split data into 67% Training Set and 33% Test Set将数据分成 67% 的训练集和 33% 的测试集

x_train1, x_test, y_train1, y_test = train_test_split(x, y, test_size=0.33)

x_train2, x_val, y_train2, y_val = train_test_split(x_train1, y_train1, test_size=0.20)

skf = StratifiedKFold(n_splits=3, shuffle = True)
skf.getn_splits(x_train2, y_train2)

for train_index, test_index in skf.split(x_train2, y_train2):
  x_train_cros, x_test_cros = x_train2[train_index], x_train2[test_index]
  y_train_cros, y_test_cros = y_train2[train_index], y_train[test_index]

Would I run skf again for the validation set as well?我会再次为验证集运行 skf 吗？ Where are the test test sets from skf created being used in sequential model?在顺序 model 中使用的 skf 创建的测试测试集在哪里？

Citation for the method I'm using:引用我正在使用的方法：

Thota, Aswini;托塔，阿斯维尼； Tilak, Priyanka;蒂拉克，普里扬卡； Ahluwalia, Simrat;阿鲁瓦利亚，西姆拉特； and Lohia, Nibrat (2018) "Fake News Detection: A Deep Learning Approach," SMU Data Science Review: Vol.和 Lohia, Nibrat（2018 年）“假新闻检测：一种深度学习方法”，SMU 数据科学评论：卷。 1: No. 3, Article 10. Available at: https://scholar.smu.edu/datasciencereview/vol1/iss3/10 1：第 3 条，第 10 条。可在： https://scholar.smu.edu/datasciencereview/vol1/iss3/10

Answer 1

You need to add one more parameter in the function 'train_test_split()':您需要在 function 'train_test_split()' 中再添加一个参数：

x_train1, x_test, y_train1, y_test = train_test_split(x, y, test_size=0.33, stratify = y)

This will give you equal distribution of all target categories.这将为您提供所有目标类别的平等分布。

如何正确拆分包含训练测试和交叉验证集的不平衡数据集

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-08-16 06:14:18

如何正确拆分包含训练测试和交叉验证集的不平衡数据集

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-08-16 06:14:18

解决方案1
2 已采纳 2020-08-16 06:14:18