简体   繁体   English

Tensorflow 使用 MirroredStrategy() 恢复训练

[英]Tensorflow resume training with MirroredStrategy()

I trained my model on a Linux operating system so I could use MirroredStrategy() and train on 2 GPUs.我在 Linux 操作系统上训练了我的 model,因此我可以使用MirroredStrategy()并在 2 个 GPU 上训练。 The training stopped at epoch 610. I want to resume training but when I load my model and evaluate it the kernel dies.训练在 epoch 610 停止。我想继续训练,但是当我加载我的 model 并对其进行评估时,kernel 死了。 I am using Jupyter Notebook.我正在使用 Jupyter 笔记本。 If I reduce my training data set the code will run but it will only run on 1 GPU.如果我减少我的训练数据集,代码将运行,但它只会在 1 GPU 上运行。 Is my distribution strategy saved in the model that I am loading or do I have to include it again?我的分发策略是保存在我正在加载的 model 中还是必须再次包含它?

UPDATE更新

I have tried to include MirroredStrategy() :我试图包括MirroredStrategy()

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():

    new_model = load_model('\\models\\model_0610.h5', 
                custom_objects = {'dice_coef_loss': dice_coef_loss, 
                'dice_coef': dice_coef}, compile = True)
    new_model.evaluate(train_x,  train_y, batch_size = 2,verbose=1)

NEW ERROR新错误

Error when I include MirroredStrategy() :包含MirroredStrategy()时出错:

ValueError: 'handle' is not available outside the replica context or a 'tf.distribute.Stragety.update()' call.

Source code:源代码:

smooth = 1
def dice_coef(y_true, y_pred):
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

def dice_coef_loss(y_true, y_pred):
    return (1. - dice_coef(y_true, y_pred))

new_model = load_model('\\models\\model_0610.h5', 
                       custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef}, compile = True)
new_model.evaluate(train_x,  train_y, batch_size = 2,verbose=1)

observe_var = 'dice_coef'
strategy = 'max' # greater dice_coef is better
model_resume_dir = '//models_resume//'

model_checkpoint = ModelCheckpoint(model_resume_dir + 'resume_{epoch:04}.h5', 
                                   monitor=observe_var, mode='auto', save_weights_only=False, 
                                   save_best_only=False, period = 2)

new_model.fit(train_x, train_y, batch_size = 2, epochs = 5000, verbose=1, shuffle = True, 
              validation_split = .15, callbacks = [model_checkpoint])

new_model.save(model_resume_dir + 'final_resume.h5')

new_model.evaluate() and compile = True when loading the model were causing the problem. new_model.evaluate()compile = True加载 model 时导致问题。 I set compile = False and added a compile line from my original script.我设置了compile = False并从我的原始脚本中添加了一个编译行。

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():

    new_model = load_model('\\models\\model_0610.h5', 
                custom_objects = {'dice_coef_loss': dice_coef_loss, 
                'dice_coef': dice_coef}, compile = False)
    new_model.compile(optimizer = Adam(learning_rate = 1e-4, loss = dice_coef_loss,
                metrics = [dice_coef])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 TensorFlow MirroredStrategy() 不适用于多 GPU 训练 - TensorFlow MirroredStrategy() not working for multi-gpu training Tensorflow 停止并恢复训练 - Tensorflow stop and resume training 如何在张量流中从* .meta恢复训练? - How to resume training from *.meta in tensorflow? 是否可以从 Tensorflow 中的检查点 model 恢复训练? - Is it possible to resume training from a checkpoint model in Tensorflow? 使用 tensorflow-gpu 1.14 和 tf.distribute.MirroredStrategy() 的自定义训练循环导致 ValueError - Custom training loop using tensorflow-gpu 1.14 and tf.distribute.MirroredStrategy() results in ValueError 如何使用 Tensorflow 2/Keras 保存和继续训练具有多个模型部分的 GAN - How to save and resume training a GAN with multiple model parts with Tensorflow 2/ Keras 如何从张量流检查点文件正确恢复网络训练? - How to resume properly the training of a network from a tensorflow checkpoint file? Tensorflows MirroredStrategy() 是否拆分训练模型? - Does Tensorflows MirroredStrategy() split the training model? Tensorflow - 无法在 Estimator 中使用具有 MirroredStrategy 分布的 BasicLSTMCell - Tensorflow - Unable to use a BasicLSTMCell with a MirroredStrategy distribution in Estimator `tf.distribute.MirroredStrategy` 对训练结果有影响吗? - Does `tf.distribute.MirroredStrategy` have an impact on training outcome?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM