Splitting data to training, testing and valuation when making Keras model

Question

I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models. Lets say that I have dataset of 1000 rows.

features = df.iloc[:,:-1]
results = df.iloc[:,-1]

Now I want to split this data into training and testing (33% of data for testing, 67% for training):

x_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)

I have read on the internet that fitting the data into model should look like this:

history = model.fit(features, results, validation_split = 0.2, epochs = 10, batch_size=50)

So I'm fitting the full data (features and results) to my model, and from that data I'm using 20% of data for validation: validation_split = 0.2 . So basically, my model will be trained with 80% of data, and tested on 20% of data.

So confusion starts when I need to evaluate the model:

score = model.evaluate(x_test, y_test, batch_size=50)

Is this correct? I mean, why should I split the data into training and testing, where does x_train and y_train go?

Can you please explain to me whats the correct order of steps for creating model?

Answer 1

Generally, in training time ( model. fit ), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.

Now, when you used

X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)

By this, you split the features and results into 33% of data for testing , 67% for training . Now, you can do two things

use the ( X_test and y_test as validation set in model.fit(...) . Or,
use them for final prediction in model. predict(...) model. predict(...)

So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:

model.fit(x=X_train, y=y_trian, 
         validation_data = (X_test, y_test), ...)

In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test) .

Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:

model.fit(x=X_train, y=y_trian, 
         validation_split = 0.2, ...)

The Keras API will take the .2 percentage of the training data ( X_train and y_train ) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:

y_pred = model.predict(x_test, batch_size=50)

Now, you can compare with y_test and y_pred with some relevant metrics.

Answer 2

Generally, you'd want to use your X_train, y_train data that you have split as arguments in the fit method. So it would look something like:

history = model.fit(X_train, y_train, batch_size=50)

While not splitting your data before throwing it into the fit method and adding the validation_split arguments work as well, just be careful to refer to the keras documentation on the validation_data and validation_split arguments to make sure that you are splitting them up as expected.

There is a related question here: https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work

Keras documentation: https://keras.rstudio.com/reference/fit.html

Answer 3

I have read on the internet that fitting the data into model should look like this:

That means you need to fit features and labels. You already split them into x_train & y_train . So your fit should look like this:

history = model.fit(x_train, y_train, validation_split = 0.2, epochs = 10, batch_size=50)

So confusion starts when I need to evaluate the model:

score = model.evaluate(x_test, y_test, batch_size=50) --> Is this correct?

That's correct, you evaluate the model by using testing features and corresponding labels. Furthermore if you want to get only for example predicted labels, you can use:

y_hat = model.predict(X_test)

Then you can compare y_hat with y_test , ie get a confusion matrix etc.

Splitting data to training, testing and valuation when making Keras model

Question

3 answers

solution1
3 ACCPTED 2021-04-19 12:55:00

solution2
1 2021-04-19 12:32:10

solution3
1 2021-04-19 12:32:12

Splitting data to training, testing and valuation when making Keras model

Question

3 answers

solution1 3 ACCPTED 2021-04-19 12:55:00

solution2 1 2021-04-19 12:32:10

solution3 1 2021-04-19 12:32:12

solution1
3 ACCPTED 2021-04-19 12:55:00

solution2
1 2021-04-19 12:32:10

solution3
1 2021-04-19 12:32:12