Keras: Shuffling data using model.fit() doesn't make a change but sklearn.train_test_split() does

Question

I'm new to Keras and facing a problem that I do not understand neither can I find any solutions in the inte.net so far.

I use the following few lines to train a simple model on the UrbanSound8K dataset:

x_train, y_train, _, _ = load_data(["data_1.pickle", "data_5.pickle"])
#x_train, _, y_train, _ = train_test_split(x_train, y_train, test_size=0.01, random_state = 0, shuffle=True)

model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

model.fit(x_train, y_train, validation_split=0.2, batch_size=32, epochs=50, shuffle=True)

When I train this model it reaches a val_accuracy of around 50%. Changing the shuffle to False in model.fit() doesn't seem to have any impact. But when I uncomment the second line and use x_train, _, y_train, _ = train_test_split(x_train, y_train, test_size=0.01, random_state = 0, shuffle=True) to shuffle the dataset the model reaches a val_accuracy of more then 80%! Regardless of model.fit() shuffle is set to True or False.

How is this possible? Shuffling the data before fitting the model shouldn't make any difference since it's training data are shuffled anyway before every epoch? Or do I misunderstand the parameter shuffle of model.fit() ? Or is there any additional magic taking place in train_test_split() ?

Answer 1

You are using a validation split of.2. Now per model.fit documentation it states

 The validation data is selected from the last samples in the x and y data provided, before shuffling.

So the only thing I can think of is when you do not use train_test_split the validation data used by model.fit is always the same data taken from the end of the unshuffled training data. When you use train_test_split the train data is shuffled so that the validation data is different in this case. If the size of the validation set is small this could make a dramatic difference in the validation accuracy computed because the validation samples are different for the two cases. I think it is poor practice for model.fit to select the validation data from the end of the training data. It should select it randomly from the training data. Even with a fairly large number of validation samples if the data at the end of the training samples has a significantly different probability distribution than the rest of the training data this could result in a much lower validation accuracy. For example if you are classifying dogs vs cats and in the training set all the images at the end are of cats then the validation images would all be cats.

Keras: Shuffling data using model.fit() doesn't make a change but sklearn.train_test_split() does

Question

1 answers

solution1
3 ACCPTED 2021-01-27 23:45:13

Keras: Shuffling data using model.fit() doesn't make a change but sklearn.train_test_split() does

Question

1 answers

solution1 3 ACCPTED 2021-01-27 23:45:13

solution1
3 ACCPTED 2021-01-27 23:45:13