為什么在Keras訓練期間，由model.evaluate（）計算的指標與跟蹤的指標不同？

Question

我正在使用Keras 2.0.4（TensorFlow后端）進行圖像分類任務（基於預先訓練的模型）。 在訓練期間/調諧我跟蹤所有使用度量（例如categorical_accuracy ， categorical crossentropy ）與CSVLogger -包括與所述驗證集（即相關聯的相應的度量val_categorical_accuracy ， val_categorical_crossentropy ）。

通過回調ModelCheckpoint我正在跟蹤權重的最佳配置（ save_best_only=True ）。 為了評估驗證集上的模型，我使用model.evaluate() 。

我的期望是： CSVLogger （“最佳”紀元）跟蹤的指標等於model.evaluate()計算的指標。 不幸的是，這種情況並非如此。 指標相差+/- 5％。 有這種現象的原因嗎？

編輯：

經過一些測試，我可以獲得一些見解：

如果我不使用生成器來訓練和驗證數據（因此也沒有model.fit_generator() ），則不會發生此問題。 ->使用ImageDataGenerator訓練和驗證數據是差異的根源。 （請注意，為了計算evaluate我不使用生成器，但是我確實使用相同的驗證數據（至少如果DataImageGenerator可以按預期工作...）。
我想，ImageDataGenerator不工作，因為它應該（請，也看看這個）。
如果我根本不使用任何生成器，就不會有這個問題。 ID EST跟蹤的度量由CSVLogger （“最佳”歷元的）等於由計算出的度量model.evaluate()
有趣的是，還有另一個問題：如果您使用相同的數據進行訓練和驗證，則在每個時期結束時，訓練指標（例如loss ）和驗證指標（例如val_loss ）之間會有差異。
（類似的問題）

使用的代碼：

############################ import section ############################
from __future__ import print_function # perform like in python 3.x
from keras.datasets import mnist
from keras.utils import np_utils # numpy utils for to_categorical()
from keras.models import Model, load_model
from keras.layers import Dense, GlobalAveragePooling2D, Dropout, GaussianDropout, Conv2D, MaxPooling2D
from keras.optimizers import SGD, Adam
from keras import backend as K
from keras.preprocessing.image import ImageDataGenerator 
from keras import metrics
import os
import sys
from scipy import misc
import numpy as np
from keras.applications.vgg16 import preprocess_input as vgg16_preprocess_input
from keras.applications import VGG16
from keras.callbacks import CSVLogger, ModelCheckpoint


############################ manual settings ###########################
# general settings
seed = 1337

loss_function = 'categorical_crossentropy'

learning_rate = 0.001

epochs = 10

batch_size = 20

nb_classes = 5 

img_width, img_height = 400, 400 # >= 48 necessary, as VGG16 is used

chosen_optimizer = SGD(lr=learning_rate, momentum=0.0, decay=0.0, nesterov=False)

steps_per_epoch = 40 // batch_size  # 40 train samples in 5 classes
validation_steps = 40 // batch_size # 40 train samples in 5 classes

data_dir = # TODO: set path where data is stored (folders: 'train', 'val', 'test'; within each folder are folders named by classes)

# callbacks: CSVLogger & ModelCheckpoint
filepath = # TODO: set path, where you want to store files generated by the callbacks
file_best_checkpoint= 'best_epoch.hdf5'
file_csvlogger = 'logged_metrics.txt'

modelcheckpoint_best_epoch= ModelCheckpoint(filepath=os.path.join(filepath, file_best_checkpoint), 
                                  monitor = 'val_loss' , verbose = 1, 
                                  save_best_only = True, 
                                  save_weights_only=False, mode='auto', 
                                  period=1) # every epoch executed
csvlogger = CSVLogger(os.path.join(filepath, file_csvlogger) , separator=',', append=False)



############################ prepare data ##############################
# get validation data (for evaluation)
X_val, Y_val = # TODO: load train data (4darray, samples, img_width, img_height, nb_channels) IMPORTANT: 5 classes with 8 images each.

# preprocess data
my_preprocessing_function = mf.my_vgg16_preprocess_input

# 'augmentation' configuration we will use for training
train_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set

# 'augmentation' configuration we will use for validation
val_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set

train_data_dir = os.path.join(data_dir, 'train')
validation_data_dir = os.path.join(data_dir, 'val')
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle = True,
    seed = seed, # random seed for shuffling and transformations
    class_mode='categorical')  # label type (categorical = one-hot vector)

validation_generator = val_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle = True,
    seed = seed, # random seed for shuffling and transformations
    class_mode='categorical')  # label type (categorical = one-hot vector)



############################## training ###############################
print("\n---------------------------------------------------------------")
print("------------------------ training model -----------------------")
print("---------------------------------------------------------------")
# create the base pre-trained model
base_model = VGG16(include_top=False, weights = None, input_shape=(img_width, img_height, 3), pooling = 'max', classes = nb_classes)
model_name =  "VGG_modified"

# do not freeze any layers --> all layers trainable
for layer in base_model.layers:
    layer.trainable = True

# define topping of base_model
x = base_model.output # get the last layer of our base_model
x = Dense(1024, activation='relu', name='fc1')(x)
x = Dense(1024, activation='relu', name='fc2')(x)
predictions = Dense(nb_classes, activation='softmax', name='predictions')(x)

# finally, stack model together
model = Model(outputs=predictions, name= model_name, inputs=base_model.input) #Keras 1.x.x: model = Model(input=base_model.input, output=predictions) 
print(model.summary())

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer = chosen_optimizer, loss=loss_function, 
            metrics=['categorical_accuracy','kullback_leibler_divergence'])

# train the model on your data
model.fit_generator(
    train_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=validation_steps,
    callbacks = [csvlogger, modelcheckpoint_best_epoch])



############################## evaluation ##############################
print("\n\n---------------------------------------------------------------")
print("------------------ Evaluation of Best Epoch -------------------")
print("---------------------------------------------------------------")
# load model (corresponding to best training epoch)
model = load_model(os.path.join(filepath, file_best_checkpoint))

# evaluate model on validation data (in test mode!)
list_of_metrics = model.evaluate(X_val, Y_val, batch_size=batch_size, verbose=1, sample_weight=None)
index = 0
print('\nMetrics:')
for metric in model.metrics_names:
    print(metric+ ':' , str(list_of_metrics[index]))
    index += 1

編輯2
參考EDIT的1 .：如果在訓練和評估過程中使用相同的生成器來驗證數據（通過使用evaluate_generator() ），則仍然會出現問題。 因此，這絕對是發電機引起的問題。

Answer 1

僅在驗證數據集上評估指標時才是這種情況。

在訓練期間在訓練數據集上計算出的度量標准無法反映該模型的真實度量標准，因為該模型將在每個批次中進行更新（修改），因此在紀元結束時。

這有幫助嗎？

為什么在Keras訓練期間，由model.evaluate（）計算的指標與跟蹤的指標不同？

問題描述

1 個解決方案

解決方案1
0 2017-05-12 12:51:52

為什么在Keras訓練期間，由model.evaluate（）計算的指標與跟蹤的指標不同？

問題描述

1 個解決方案

解決方案1 0 2017-05-12 12:51:52

解決方案1
0 2017-05-12 12:51:52