简体   繁体   English

Keras:如何为验证集抽取随机样本?

[英]Keras: How to take random samples for validation set?

I'm currently training a Keras model whose corresponding fit call looks as follows:我目前正在训练一个 Keras 模型,其相应的拟合调用如下所示:

model.fit(X,y_train,batch_size=myBatchSize,epochs=myAmountOfEpochs,validation_split=0.1,callbacks=myCallbackList)

This comment on the Keras Github page explains the meaning of "validation_split=0.1": Keras Github 页面上的这条评论解释了“validation_split=0.1”的含义:

The validation data is not necessarily taken from every class and it is just the last 10% (assuming that you ask for 10%) of the data.验证数据不一定取自每个班级,它只是数据的最后 10%(假设您要求 10%)。

My question is now: Is there an easy way to randomly select, say, 10 % of my training data as validation data?我现在的问题是:有没有一种简单的方法可以随机选择我的训练数据的 10% 作为验证数据? The reason I would like to use randomly picked samples is that the last 10 % of the data don't necessarily contain all classes in my case.我想使用随机选取的样本的原因是,在我的情况下,最后 10% 的数据不一定包含所有类。

Thank you very much.非常感谢。

Keras doesn't provide any more advanced feature than just taking a fraction of your training data for validation. Keras 不提供任何更高级的功能,而不仅仅是取一小部分训练数据进行验证。 If you need something more advanced, like stratified sampling to make sure classes are well represented in the sample, then you need to do this manually outside of Keras (using say, scikit-learn or numpy) and then pass that validation data to keras through the validation_data parameter in model.fit如果您需要更高级的东西,例如分层抽样以确保类在样本中得到很好的表示,那么您需要在 Keras 之外手动执行此操作(例如使用 scikit-learn 或 numpy),然后将该验证数据传递给 keras model.fitvalidation_data参数

Thanks to the comments of Matias Valdenegro , I was inspired to look a bit further and came up with the following solution to my problem:多亏了Matias Valdenegro的评论,我受到启发,想看得更远,并为我的问题提出了以下解决方案:

from sklearn.model_selection import train_test_split
[input: X and Y]
XTraining, XValidation, YTraining, YValidation = train_test_split(X,Y,stratify=Y,test_size=0.1) # before model building
[The model is built here...]
model.fit(XTraining,YTraining,batch_size=batchSize,epochs=amountOfEpochs,validation_data=(XValidation,YValidation),callbacks=callbackList)

In this post I have suggested a solution which uses the split-folders package to randomly split your main data directory into training and validation directories while maintaining the class sub-folders.这篇文章中,我提出了一个解决方案,它使用split-folders包将主数据目录随机拆分为训练和验证目录,同时维护类子文件夹。 You can than use the keras .flow_from_directory method to specify your train and validation paths.您可以使用 keras .flow_from_directory方法来指定您的训练和验证路径。

Splitting your folders from the docs:从文档中拆分文件夹:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

The input folder shoud have the following format:输入文件夹应具有以下格式:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

In order to give you this:为了给你这个:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

Using keras ImageDataGenerator to build your training and validation datasets:使用 keras ImageDataGenerator构建训练和验证数据集:

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)

根据Keras 入门常见问题解答,您可以在model.fit使用shuffle参数。

In the model.fit() arguments, The validation_data will overwrite the validation_split, so there is no need to configure both of them the same time.model.fit()参数中,validation_data会覆盖validation_split,所以不需要同时配置。

validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.

validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`

But there is one option to fulfill your purpose, its the argument shuffle但是有一种选择可以实现你的目的,它的论点shuffle

shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.

So what you could do is:所以你可以做的是:

model.fit(**other_kwargs, validation_split = 0.1, shuffle=True)

Comment is not long enough so I post it here.评论不够长所以我把它贴在这里。

If you have 1000 training data, 100 testing data, validation_split=0.1 and batch_size=100, what it would do is: splitting on training data (batch 1: 90 training and 10 validation, batch 2: 90 training and 10 validation, ..., all in original order, 90,10,90,10...90,10) and it has nothing to do with the 100 testing data (it would never be seen by your model).如果你有 1000 个训练数据,100 个测试数据,validation_split=0.1 和 batch_size=100,它会做的是:分割训练数据(批次 1:90 个训练和 10 个验证,批次 2:90 个训练和 10 个验证,.. .,全部按原始顺序,90,10,90,10...90,10)并且它与 100 个测试数据无关(您的模型永远不会看到它)。 So I guess you only want to shuffle all the size-10 validation sets only without touching the 90-size training sets.所以我猜你只想洗牌所有大小为 10 的验证集,而不触及 90 大小的训练集。 What I might do is to manually shuffle the 10% part of my data, because that's what shuffle=True do, it just shuffle the index and replace the old training data with new one of shuffle index ,like this):我可能做的是手动混洗数据的 10% 部分,因为这就是shuffle=True所做的,它只是混洗索引并用新的混洗索引替换旧的训练数据,如下所示):

import numpy as np
train_index = np.arange(1000,dtype=np.int32)
split = 0.1
batch_size = 100
num_batch = int(len(train_index)/batch_size)
train_index = np.reshape(train_index,(num_batch,batch_size))
for i in range(num_batch):
    r = np.random.choice(range(10),10,replace=False)
    print(r)
    train_index[i,int((1-split)*batch_size):] = np.array(r+((1-split)*batch_size)+i*batch_size)
    print(train_index[i])

flatten_index = train_index.reshape(-1)
print(flatten_index)

x_train = np.arange(1000,2000)
x_train = x_train[flatten_index]
print(x_train)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Keras TimeseriesGenerator对每n个训练样本取一个验证样本? - How to use Keras TimeseriesGenerator to take a validation sample for every n training samples? cross_val_score是采用顺序样本还是随机样本? - Does cross_val_score take sequential samples or random samples? 如何从不同的列表中随机抽取样本并将它们组合成一个新列表? - How to take random samples from different lists and combine them into a new list? 每次从不同数量的数据中随机抽样 - Take random samples from the data with different number each time 如何在 GridSearchCV 的 keras model 的超参数优化中使用简单的验证集? - How to use a simple validation set in hyperparameter optimization of keras model with GridSearchCV? 如何选择具有特定ID的元素的随机样本 - how to select random samples of elements with a particular id 如何在 Python 中从总体中生成随机样本? - How to generate random samples from a population in Python? 如何聚合 5 个随机样本的字典? - How do I aggregate 5 random samples of a dict? 如何将数据集分成 2 个以上的随机样本 - How to separate a dataset into more than 2 random samples 梯度如何通过随机样本反向传播? - How does a gradient backpropagates through random samples?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM