制作 Keras model 时将数据拆分为训练、测试和评估

Question

I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models.在制作和评估 Keras 机器学习模型时，我对拆分数据集有点困惑。 Lets say that I have dataset of 1000 rows.假设我有 1000 行的数据集。

features = df.iloc[:,:-1]
results = df.iloc[:,-1]

Now I want to split this data into training and testing (33% of data for testing, 67% for training):现在我想将这些数据分成训练和测试（33% 的数据用于测试，67% 用于训练）：

x_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)

I have read on the internet that fitting the data into model should look like this:我在互联网上读到将数据拟合到 model 应该如下所示：

history = model.fit(features, results, validation_split = 0.2, epochs = 10, batch_size=50)

So I'm fitting the full data (features and results) to my model, and from that data I'm using 20% of data for validation: validation_split = 0.2 .因此，我将完整的数据（特征和结果）拟合到我的 model 中，并从该数据中使用 20% 的数据进行验证： validation_split = 0.2 。 So basically, my model will be trained with 80% of data, and tested on 20% of data.所以基本上，我的 model 将使用 80% 的数据进行训练，并在 20% 的数据上进行测试。

So confusion starts when I need to evaluate the model:因此，当我需要评估 model 时，就会出现混乱：

score = model.evaluate(x_test, y_test, batch_size=50)

Is this correct?这个对吗？ I mean, why should I split the data into training and testing, where does x_train and y_train go?我的意思是，我为什么要把数据分成训练和测试，x_train 和 y_train go 在哪里？

Can you please explain to me whats the correct order of steps for creating model?您能否向我解释一下创建 model 的正确步骤顺序是什么？

Answer 1

Generally, in training time ( model. fit ), you have two sets: one is for the training set and another is for validation/tuning/development set.通常，在训练时（ model. fit ），您有两组：一组用于训练集，另一组用于验证/调整/开发集。 With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter.使用训练集，您训练 model，使用验证集，您需要找到最佳的超参数集。 And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.完成后，您可以使用看不见的数据集测试 model - 与训练或验证集不同，该数据集完全隐藏在 model 之外。

Now, when you used现在，当你使用

X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)

By this, you split the features and results into 33% of data for testing , 67% for training .这样，您将features和results分成33%的数据用于测试， 67%用于训练。 Now, you can do two things现在，你可以做两件事

use the ( X_test and y_test as validation set in model.fit(...) . Or,使用（ X_test和y_test作为model.fit(...)中的验证集。或者，
use them for final prediction in model. predict(...)在model. predict(...) model. predict(...)

So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:因此，如果您选择这些测试集作为验证集（编号 1 ），您将执行以下操作：

model.fit(x=X_train, y=y_trian, 
         validation_data = (X_test, y_test), ...)

In the training log, you will get the validation results along with the training score.在训练日志中，您将获得验证结果以及训练分数。 The validation results should be the same if you later compute model.evaluate(X_test, y_test) .如果您稍后计算model.evaluate(X_test, y_test)验证结果应该相同。

Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:现在，如果您选择这些测试集作为最终预测或最终评估集（编号 2 ），那么您需要重新制作验证集或使用validation_split参数，如下所示：

model.fit(x=X_train, y=y_trian, 
         validation_split = 0.2, ...)

The Keras API will take the .2 percentage of the training data ( X_train and y_train ) and use it for validation. Keras API 将采用.2 % 的训练数据（ X_train和y_train ）并将其用于验证。 And lastly, for the final evaluation of your model, you can do as follows:最后，对于您的 model 的最终评估，您可以执行以下操作：

y_pred = model.predict(x_test, batch_size=50)

Now, you can compare with y_test and y_pred with some relevant metrics.现在，您可以将y_test和y_pred与一些相关指标进行比较。

Answer 2

Generally, you'd want to use your X_train, y_train data that you have split as arguments in the fit method.通常，您希望在 fit 方法中使用已拆分为 arguments 的 X_train、y_train 数据。 So it would look something like:所以它看起来像：

history = model.fit(X_train, y_train, batch_size=50)

While not splitting your data before throwing it into the fit method and adding the validation_split arguments work as well, just be careful to refer to the keras documentation on the validation_data and validation_split arguments to make sure that you are splitting them up as expected.虽然在将数据放入 fit 方法之前不拆分数据并添加 validation_split arguments 也可以，但请注意参考 keras 文档，validation_data 和 validation_split ZDBC11CAA5BDA99F77E6FB4DABD882E7 以确保按预期拆分它们。

There is a related question here: https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work这里有一个相关的问题： https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work

Keras documentation: https://keras.rstudio.com/reference/fit.html Keras 文档： https://keras.rstudio.com/reference/fit.ZFC35FDC70D5FC69D269883A8EZC

Answer 3

I have read on the internet that fitting the data into model should look like this:我在互联网上读到将数据拟合到 model 应该如下所示：

That means you need to fit features and labels.这意味着您需要拟合特征和标签。 You already split them into x_train & y_train .您已经将它们拆分为x_train和y_train 。 So your fit should look like this:所以你的合身应该是这样的：

history = model.fit(x_train, y_train, validation_split = 0.2, epochs = 10, batch_size=50)

So confusion starts when I need to evaluate the model:因此，当我需要评估 model 时，就会出现混乱：

score = model.evaluate(x_test, y_test, batch_size=50) --> Is this correct? score = model.evaluate(x_test, y_test, batch_size=50) --> 这是正确的吗？

That's correct, you evaluate the model by using testing features and corresponding labels.没错，您通过使用测试功能和相应的标签来评估 model。 Furthermore if you want to get only for example predicted labels, you can use:此外，如果您只想获得例如预测标签，您可以使用：

y_hat = model.predict(X_test)

Then you can compare y_hat with y_test , ie get a confusion matrix etc.然后您可以将y_hat与y_test进行比较，即得到一个混淆矩阵等。

制作 Keras model 时将数据拆分为训练、测试和评估

问题描述

3 个解决方案

解决方案1
3 已采纳 2021-04-19 12:55:00

解决方案2
1 2021-04-19 12:32:10

解决方案3
1 2021-04-19 12:32:12

制作 Keras model 时将数据拆分为训练、测试和评估

问题描述

3 个解决方案

解决方案1 3 已采纳 2021-04-19 12:55:00

解决方案2 1 2021-04-19 12:32:10

解决方案3 1 2021-04-19 12:32:12

解决方案1
3 已采纳 2021-04-19 12:55:00

解决方案2
1 2021-04-19 12:32:10

解决方案3
1 2021-04-19 12:32:12