简体   繁体   中英

Saved model trained with MirroredStrategy has poor performance when loaded

I trained a UNet binary segmentation model using tf.distribute.MirroredStrategy with multi-gpu setup (2x NVidia 4090). The model seems to work fine during the training since the dice loss gets improves from initial cca 0.8 to 0.1. I am using the ModelCheckpoint callback to save the best model from the training session. When I try to load the model from the.h5 file, the prediction output is very bad (only a few random pixels is segmented). (I even tried to use data from the validation set, that were previously successfully predicted during the training.) This behavior did not occur before I moved to the multi-gpu/MirroredStrategy setup. I tried both saving the model and saving only the model weights. Does anyone have any ideas, what can cause this issue?

This is my training function:

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
  model = UNet(
               (image_width, image_height, 3), 
               batchnorm=True, 
               start_ch=start_channel_count, 
               depth=layer_count,     
               residual=use_residual)
  model.compile(optimizer=Adam(learning_rate=learning_rate), 
                loss=dice_coef_loss, metrics= 
                  [tf.keras.metrics.BinaryAccuracy(), 
                  tf.keras.metrics.MeanIoU(num_classes=2)])

def scheduler(epoch, lr):
  if epoch < 10:
      return lr
  else:
      return lr * tf.math.exp(-0.1)
mc = ModelCheckpoint(os.path.join(run_dir_path, "model.h5"), 
                     monitor='val_loss', verbose=1, 
                     save_benter code here`est_only=True, 
                     save_weights_only = True)

history = model.fit(train_data_generator,
                    validation_data=validation_data_generator,
                    callbacks=[mc], 
                    validation_steps=
                      math.ceil(validation_count/batch_size),
                    steps_per_epoch=
                      math.ceil(train_count / batch_size),
                    epochs=100)

Then when I try to do the evaluation of the trained model it works:

model.evaluate(validation_data, 
               steps=math.ceil(validation_count / batch_size))

When I try to perform the evaluation after loading the weights from the.h5 file, it performs badly:

model.load_weights(os.path.join(run_dir_path, "model.h5"))
model.evaluate(validation_data, 
               steps=math.ceil(validation_count / batch_size))

I don't think the issue is because of training with MirroredStrategy. You can check this issue on official keras github - model.evaluate() gives a different loss on training data from the one in training process .

Also try to use model.predict function to eval the model and write your own custom metric calculation to see if the results are consistent.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM