简体   繁体   中英

Keras: How to take random samples for validation set?

I'm currently training a Keras model whose corresponding fit call looks as follows:

model.fit(X,y_train,batch_size=myBatchSize,epochs=myAmountOfEpochs,validation_split=0.1,callbacks=myCallbackList)

This comment on the Keras Github page explains the meaning of "validation_split=0.1":

The validation data is not necessarily taken from every class and it is just the last 10% (assuming that you ask for 10%) of the data.

My question is now: Is there an easy way to randomly select, say, 10 % of my training data as validation data? The reason I would like to use randomly picked samples is that the last 10 % of the data don't necessarily contain all classes in my case.

Thank you very much.

Keras doesn't provide any more advanced feature than just taking a fraction of your training data for validation. If you need something more advanced, like stratified sampling to make sure classes are well represented in the sample, then you need to do this manually outside of Keras (using say, scikit-learn or numpy) and then pass that validation data to keras through the validation_data parameter in model.fit

Thanks to the comments of Matias Valdenegro , I was inspired to look a bit further and came up with the following solution to my problem:

from sklearn.model_selection import train_test_split
[input: X and Y]
XTraining, XValidation, YTraining, YValidation = train_test_split(X,Y,stratify=Y,test_size=0.1) # before model building
[The model is built here...]
model.fit(XTraining,YTraining,batch_size=batchSize,epochs=amountOfEpochs,validation_data=(XValidation,YValidation),callbacks=callbackList)

In this post I have suggested a solution which uses the split-folders package to randomly split your main data directory into training and validation directories while maintaining the class sub-folders. You can than use the keras .flow_from_directory method to specify your train and validation paths.

Splitting your folders from the docs:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

The input folder shoud have the following format:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

In order to give you this:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

Using keras ImageDataGenerator to build your training and validation datasets:

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)

根据Keras 入门常见问题解答,您可以在model.fit使用shuffle参数。

In the model.fit() arguments, The validation_data will overwrite the validation_split, so there is no need to configure both of them the same time.

validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.

validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`

But there is one option to fulfill your purpose, its the argument shuffle

shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.

So what you could do is:

model.fit(**other_kwargs, validation_split = 0.1, shuffle=True)

Comment is not long enough so I post it here.

If you have 1000 training data, 100 testing data, validation_split=0.1 and batch_size=100, what it would do is: splitting on training data (batch 1: 90 training and 10 validation, batch 2: 90 training and 10 validation, ..., all in original order, 90,10,90,10...90,10) and it has nothing to do with the 100 testing data (it would never be seen by your model). So I guess you only want to shuffle all the size-10 validation sets only without touching the 90-size training sets. What I might do is to manually shuffle the 10% part of my data, because that's what shuffle=True do, it just shuffle the index and replace the old training data with new one of shuffle index ,like this):

import numpy as np
train_index = np.arange(1000,dtype=np.int32)
split = 0.1
batch_size = 100
num_batch = int(len(train_index)/batch_size)
train_index = np.reshape(train_index,(num_batch,batch_size))
for i in range(num_batch):
    r = np.random.choice(range(10),10,replace=False)
    print(r)
    train_index[i,int((1-split)*batch_size):] = np.array(r+((1-split)*batch_size)+i*batch_size)
    print(train_index[i])

flatten_index = train_index.reshape(-1)
print(flatten_index)

x_train = np.arange(1000,2000)
x_train = x_train[flatten_index]
print(x_train)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM