简体   繁体   English

是否需要将数据一分为三; 训练、验证和测试?

[英]Is it necessary to split data into three; train, val and test?

Here the difference between test, train and validation set is described. 这里描述了测试、训练和验证集之间的区别。 In most documentation on training neural networks, I find that these three sets are used, however they are often predefined.在大多数关于训练神经网络的文档中,我发现使用了这三组,但它们通常是预定义的。

I have a relatively small data set (906 3D images in total, the distribution is balanced).我有一个相对较小的数据集(总共 906 张 3D 图像,分布是平衡的)。 I'm using sklearn.model_selection.train_test_split function to split the data in train and test set and using X_test and y_test as validation data in my model.我正在使用sklearn.model_selection.train_test_split函数来拆分训练和测试集中的数据,并在我的模型中使用 X_test 和 y_test 作为验证数据。

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
...
history = AD_model.fit(
    X_train, 
    y_train, 
    batch_size=batch_size,
    epochs=100,
    verbose=1,
    validation_data=(X_test, y_test))

After training, I evaluate the model on the test set:训练后,我在测试集上评估模型:

test_loss, test_acc = AD_model.evaluate(X_test,  y_test, verbose=2)

I've seen other people also approach it this way, but since the model has already seen this data, I'm not sure what the consequences are of this approach.我见过其他人也用这种方法处理过它,但由于模型已经看到了这些数据,我不确定这种方法的后果是什么。 Can someone tell me what the consequences are of using the same set for validation and testing?有人能告诉我使用相同的集合进行验证和测试会产生什么后果吗? And since I already have a small data set (with overfitting as a result), is it necessary to split the data in 3 sets?而且由于我已经有一个小数据集(结果是过度拟合),是否有必要将数据分成 3 组?

You can use train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))]) it produces a 60%, 20%, 20% split for training, validation and test sets.你可以使用train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])它产生一个 60 %、20%、20% 用于训练、验证和测试集。

Hope it's Helpfull Thank you for reading!!希望对大家有帮助谢谢阅读!!

Validation set can be used for the following:验证集可用于以下用途:

  • Monitor the performance of your on-going training model on data that was not part of the training set.在不属于训练集的数据上监控正在进行的训练模型的性能。 This can help you verify that your model is correctly training and not over-fitting.这可以帮助您验证您的模型是否正确训练并且没有过度拟合。
  • Select hyper-parameters that give you best performance.选择可为您提供最佳性能的超参数。
  • Select best snapshot/weights or stopping epoch based on the validation metrics.根据验证指标选择最佳快照/权重或停止 epoch。

Having the same set for both validation and test will prevent you from comparing your model to any other method on the same data in an un-biased way since the model's hyperparemeters (and stopping criteria) was selected to maximize performance on this set.为验证和测试使用相同的集合将阻止您以无偏见的方式将模型与相同数据的任何其他方法进行比较,因为选择模型的超参数(和停止标准)是为了最大限度地提高该集合的性能。 It will also make your results a bit optimistic since a validation set (on which the model was selected) is probably easier than an unseen test set.它也会使您的结果变得有点乐观,因为验证集(在其上选择模型)可能比看不见的测试集更容易。

This is what I do:这就是我所做的:

  1. Split the data into %80 train and %20 test sets将数据拆分为 %80 个train集和 %20 个test
  2. With train data, do 5 fold cross-validation.使用训练数据,进行 5 折交叉验证。 Note that the train set will also be splitted again %80-20 due to cross validation in each fold, but CV modules does it themselves (eg sklearn's cross validation).请注意,由于每个折叠中的交叉验证, train集也将再次拆分 %80-20,但 CV 模块会自己进行(例如 sklearn 的交叉验证)。 So you don't have to split it again manually所以你不必再手动拆分
  3. After each fold, evaluate the model using test set每次折叠后,使用测试集评估模型
  4. After 5 folds, by mean and std of CV scores, decide the model's accuracy. 5 折后,通过 CV 分数的平均值和标准,决定模型的准确性。 You can also add classification reports, confusion matrixes, loss and accuracy plots etc.您还可以添加分类报告、混淆矩阵、损失和准确度图等。

The reason of using train, validation and test set is in one fold, the model trains itself using train data, optimizes itself using validation data, and at the end of training I test the model using test data.使用训练、验证和测试集的原因是一方面,模型使用训练数据训练自己,使用验证数据优化自己,在训练结束时我使用测试数据测试模型。

This is why using a complete seperate test set is good to decide if the model's accuracy is satisfying enough: the model optimizes itself by using error of evaluating validation data.这就是为什么使用完整的单独测试集可以很好地确定模型的准确性是否足够令人满意:模型通过使用评估验证数据的误差来优化自身。 If you evaluate it using validation data again, it's not fair, because the model kinda seen it before.如果您再次使用验证数据对其进行评估,这是不公平的,因为该模型以前见过它。

For your situation, if you can manage to split the data equally (same amount of each class in the test set), yes it's still good to split to 3 sets.对于您的情况,如果您可以设法平均拆分数据( test集中每个类的数量相同),是的,拆分为 3 组仍然很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM