繁体   English   中英

Keras:如何为验证集抽取随机样本?

[英]Keras: How to take random samples for validation set?

我目前正在训练一个 Keras 模型,其相应的拟合调用如下所示:

model.fit(X,y_train,batch_size=myBatchSize,epochs=myAmountOfEpochs,validation_split=0.1,callbacks=myCallbackList)

Keras Github 页面上的这条评论解释了“validation_split=0.1”的含义:

验证数据不一定取自每个班级,它只是数据的最后 10%(假设您要求 10%)。

我现在的问题是:有没有一种简单的方法可以随机选择我的训练数据的 10% 作为验证数据? 我想使用随机选取的样本的原因是,在我的情况下,最后 10% 的数据不一定包含所有类。

非常感谢。

Keras 不提供任何更高级的功能,而不仅仅是取一小部分训练数据进行验证。 如果您需要更高级的东西,例如分层抽样以确保类在样本中得到很好的表示,那么您需要在 Keras 之外手动执行此操作(例如使用 scikit-learn 或 numpy),然后将该验证数据传递给 keras model.fitvalidation_data参数

多亏了Matias Valdenegro的评论,我受到启发,想看得更远,并为我的问题提出了以下解决方案:

from sklearn.model_selection import train_test_split
[input: X and Y]
XTraining, XValidation, YTraining, YValidation = train_test_split(X,Y,stratify=Y,test_size=0.1) # before model building
[The model is built here...]
model.fit(XTraining,YTraining,batch_size=batchSize,epochs=amountOfEpochs,validation_data=(XValidation,YValidation),callbacks=callbackList)

这篇文章中,我提出了一个解决方案,它使用split-folders包将主数据目录随机拆分为训练和验证目录,同时维护类子文件夹。 您可以使用 keras .flow_from_directory方法来指定您的训练和验证路径。

从文档中拆分文件夹:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

输入文件夹应具有以下格式:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

为了给你这个:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

使用 keras ImageDataGenerator构建训练和验证数据集:

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)

根据Keras 入门常见问题解答,您可以在model.fit使用shuffle参数。

model.fit()参数中,validation_data会覆盖validation_split,所以不需要同时配置。

validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.

validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`

但是有一种选择可以实现你的目的,它的论点shuffle

shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.

所以你可以做的是:

model.fit(**other_kwargs, validation_split = 0.1, shuffle=True)

评论不够长所以我把它贴在这里。

如果你有 1000 个训练数据,100 个测试数据,validation_split=0.1 和 batch_size=100,它会做的是:分割训练数据(批次 1:90 个训练和 10 个验证,批次 2:90 个训练和 10 个验证,.. .,全部按原始顺序,90,10,90,10...90,10)并且它与 100 个测试数据无关(您的模型永远不会看到它)。 所以我猜你只想洗牌所有大小为 10 的验证集,而不触及 90 大小的训练集。 我可能做的是手动混洗数据的 10% 部分,因为这就是shuffle=True所做的,它只是混洗索引并用新的混洗索引替换旧的训练数据,如下所示):

import numpy as np
train_index = np.arange(1000,dtype=np.int32)
split = 0.1
batch_size = 100
num_batch = int(len(train_index)/batch_size)
train_index = np.reshape(train_index,(num_batch,batch_size))
for i in range(num_batch):
    r = np.random.choice(range(10),10,replace=False)
    print(r)
    train_index[i,int((1-split)*batch_size):] = np.array(r+((1-split)*batch_size)+i*batch_size)
    print(train_index[i])

flatten_index = train_index.reshape(-1)
print(flatten_index)

x_train = np.arange(1000,2000)
x_train = x_train[flatten_index]
print(x_train)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM