簡體   English   中英

Keras:如何為驗證集抽取隨機樣本?

[英]Keras: How to take random samples for validation set?

我目前正在訓練一個 Keras 模型,其相應的擬合調用如下所示:

model.fit(X,y_train,batch_size=myBatchSize,epochs=myAmountOfEpochs,validation_split=0.1,callbacks=myCallbackList)

Keras Github 頁面上的這條評論解釋了“validation_split=0.1”的含義:

驗證數據不一定取自每個班級,它只是數據的最后 10%(假設您要求 10%)。

我現在的問題是:有沒有一種簡單的方法可以隨機選擇我的訓練數據的 10% 作為驗證數據? 我想使用隨機選取的樣本的原因是,在我的情況下,最后 10% 的數據不一定包含所有類。

非常感謝。

Keras 不提供任何更高級的功能,而不僅僅是取一小部分訓練數據進行驗證。 如果您需要更高級的東西,例如分層抽樣以確保類在樣本中得到很好的表示,那么您需要在 Keras 之外手動執行此操作(例如使用 scikit-learn 或 numpy),然后將該驗證數據傳遞給 keras model.fitvalidation_data參數

多虧了Matias Valdenegro的評論,我受到啟發,想看得更遠,並為我的問題提出了以下解決方案:

from sklearn.model_selection import train_test_split
[input: X and Y]
XTraining, XValidation, YTraining, YValidation = train_test_split(X,Y,stratify=Y,test_size=0.1) # before model building
[The model is built here...]
model.fit(XTraining,YTraining,batch_size=batchSize,epochs=amountOfEpochs,validation_data=(XValidation,YValidation),callbacks=callbackList)

這篇文章中,我提出了一個解決方案,它使用split-folders包將主數據目錄隨機拆分為訓練和驗證目錄,同時維護類子文件夾。 您可以使用 keras .flow_from_directory方法來指定您的訓練和驗證路徑。

從文檔中拆分文件夾:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

輸入文件夾應具有以下格式:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

為了給你這個:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

使用 keras ImageDataGenerator構建訓練和驗證數據集:

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)

根據Keras 入門常見問題解答,您可以在model.fit使用shuffle參數。

model.fit()參數中,validation_data會覆蓋validation_split,所以不需要同時配置。

validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.

validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`

但是有一種選擇可以實現你的目的,它的論點shuffle

shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.

所以你可以做的是:

model.fit(**other_kwargs, validation_split = 0.1, shuffle=True)

評論不夠長所以我把它貼在這里。

如果你有 1000 個訓練數據,100 個測試數據,validation_split=0.1 和 batch_size=100,它會做的是:分割訓練數據(批次 1:90 個訓練和 10 個驗證,批次 2:90 個訓練和 10 個驗證,.. .,全部按原始順序,90,10,90,10...90,10)並且它與 100 個測試數據無關(您的模型永遠不會看到它)。 所以我猜你只想洗牌所有大小為 10 的驗證集,而不觸及 90 大小的訓練集。 我可能做的是手動混洗數據的 10% 部分,因為這就是shuffle=True所做的,它只是混洗索引並用新的混洗索引替換舊的訓練數據,如下所示):

import numpy as np
train_index = np.arange(1000,dtype=np.int32)
split = 0.1
batch_size = 100
num_batch = int(len(train_index)/batch_size)
train_index = np.reshape(train_index,(num_batch,batch_size))
for i in range(num_batch):
    r = np.random.choice(range(10),10,replace=False)
    print(r)
    train_index[i,int((1-split)*batch_size):] = np.array(r+((1-split)*batch_size)+i*batch_size)
    print(train_index[i])

flatten_index = train_index.reshape(-1)
print(flatten_index)

x_train = np.arange(1000,2000)
x_train = x_train[flatten_index]
print(x_train)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM