Keras：如何为验证集抽取随机样本？

Question

I'm currently training a Keras model whose corresponding fit call looks as follows:我目前正在训练一个 Keras 模型，其相应的拟合调用如下所示：

model.fit(X,y_train,batch_size=myBatchSize,epochs=myAmountOfEpochs,validation_split=0.1,callbacks=myCallbackList)

This comment on the Keras Github page explains the meaning of "validation_split=0.1": Keras Github 页面上的这条评论解释了“validation_split=0.1”的含义：

The validation data is not necessarily taken from every class and it is just the last 10% (assuming that you ask for 10%) of the data.验证数据不一定取自每个班级，它只是数据的最后 10%（假设您要求 10%）。

My question is now: Is there an easy way to randomly select, say, 10 % of my training data as validation data?我现在的问题是：有没有一种简单的方法可以随机选择我的训练数据的 10% 作为验证数据？ The reason I would like to use randomly picked samples is that the last 10 % of the data don't necessarily contain all classes in my case.我想使用随机选取的样本的原因是，在我的情况下，最后 10% 的数据不一定包含所有类。

Thank you very much.非常感谢。

Answer 1

Keras doesn't provide any more advanced feature than just taking a fraction of your training data for validation. Keras 不提供任何更高级的功能，而不仅仅是取一小部分训练数据进行验证。 If you need something more advanced, like stratified sampling to make sure classes are well represented in the sample, then you need to do this manually outside of Keras (using say, scikit-learn or numpy) and then pass that validation data to keras through the validation_data parameter in model.fit如果您需要更高级的东西，例如分层抽样以确保类在样本中得到很好的表示，那么您需要在 Keras 之外手动执行此操作（例如使用 scikit-learn 或 numpy），然后将该验证数据传递给 keras model.fit的validation_data参数

Answer 2

Thanks to the comments of Matias Valdenegro , I was inspired to look a bit further and came up with the following solution to my problem:多亏了Matias Valdenegro的评论，我受到启发，想看得更远，并为我的问题提出了以下解决方案：

from sklearn.model_selection import train_test_split
[input: X and Y]
XTraining, XValidation, YTraining, YValidation = train_test_split(X,Y,stratify=Y,test_size=0.1) # before model building
[The model is built here...]
model.fit(XTraining,YTraining,batch_size=batchSize,epochs=amountOfEpochs,validation_data=(XValidation,YValidation),callbacks=callbackList)

Answer 3

In this post I have suggested a solution which uses the split-folders package to randomly split your main data directory into training and validation directories while maintaining the class sub-folders.在这篇文章中，我提出了一个解决方案，它使用split-folders包将主数据目录随机拆分为训练和验证目录，同时维护类子文件夹。 You can than use the keras .flow_from_directory method to specify your train and validation paths.您可以使用 keras .flow_from_directory方法来指定您的训练和验证路径。

Splitting your folders from the docs:从文档中拆分文件夹：

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

The input folder shoud have the following format:输入文件夹应具有以下格式：

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

In order to give you this:为了给你这个：

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

Using keras ImageDataGenerator to build your training and validation datasets:使用 keras ImageDataGenerator构建训练和验证数据集：

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)

Answer 4

根据Keras 入门常见问题解答，您可以在model.fit使用shuffle参数。

Answer 5

In the model.fit() arguments, The validation_data will overwrite the validation_split, so there is no need to configure both of them the same time.在model.fit()参数中，validation_data会覆盖validation_split，所以不需要同时配置。

validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.

validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`

But there is one option to fulfill your purpose, its the argument shuffle但是有一种选择可以实现你的目的，它的论点shuffle

shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.

So what you could do is:所以你可以做的是：

model.fit(**other_kwargs, validation_split = 0.1, shuffle=True)

Answer 6

Comment is not long enough so I post it here.评论不够长所以我把它贴在这里。

If you have 1000 training data, 100 testing data, validation_split=0.1 and batch_size=100, what it would do is: splitting on training data (batch 1: 90 training and 10 validation, batch 2: 90 training and 10 validation, ..., all in original order, 90,10,90,10...90,10) and it has nothing to do with the 100 testing data (it would never be seen by your model).如果你有 1000 个训练数据，100 个测试数据，validation_split=0.1 和 batch_size=100，它会做的是：分割训练数据（批次 1：90 个训练和 10 个验证，批次 2：90 个训练和 10 个验证，.. .，全部按原始顺序，90,10,90,10...90,10）并且它与 100 个测试数据无关（您的模型永远不会看到它）。 So I guess you only want to shuffle all the size-10 validation sets only without touching the 90-size training sets.所以我猜你只想洗牌所有大小为 10 的验证集，而不触及 90 大小的训练集。 What I might do is to manually shuffle the 10% part of my data, because that's what shuffle=True do, it just shuffle the index and replace the old training data with new one of shuffle index ,like this):我可能做的是手动混洗数据的 10% 部分，因为这就是shuffle=True所做的，它只是混洗索引并用新的混洗索引替换旧的训练数据，如下所示）：

import numpy as np
train_index = np.arange(1000,dtype=np.int32)
split = 0.1
batch_size = 100
num_batch = int(len(train_index)/batch_size)
train_index = np.reshape(train_index,(num_batch,batch_size))
for i in range(num_batch):
    r = np.random.choice(range(10),10,replace=False)
    print(r)
    train_index[i,int((1-split)*batch_size):] = np.array(r+((1-split)*batch_size)+i*batch_size)
    print(train_index[i])

flatten_index = train_index.reshape(-1)
print(flatten_index)

x_train = np.arange(1000,2000)
x_train = x_train[flatten_index]
print(x_train)

Keras：如何为验证集抽取随机样本？

问题描述

6 个解决方案

解决方案1
3 2018-09-21 08:49:00

解决方案2
2 已采纳 2018-09-21 10:07:03

解决方案3
1 2020-06-30 16:47:52

解决方案4
0 2019-02-17 22:41:58

解决方案5
0 2020-05-18 23:08:07

解决方案6
0 2020-05-20 20:43:40

Keras：如何为验证集抽取随机样本？

问题描述

6 个解决方案

解决方案1 3 2018-09-21 08:49:00

解决方案2 2 已采纳 2018-09-21 10:07:03

解决方案3 1 2020-06-30 16:47:52

解决方案4 0 2019-02-17 22:41:58

解决方案5 0 2020-05-18 23:08:07

解决方案6 0 2020-05-20 20:43:40

解决方案1
3 2018-09-21 08:49:00

解决方案2
2 已采纳 2018-09-21 10:07:03

解决方案3
1 2020-06-30 16:47:52

解决方案4
0 2019-02-17 22:41:58

解决方案5
0 2020-05-18 23:08:07

解决方案6
0 2020-05-20 20:43:40