簡體   English   中英

Tensorflow 使用 MirroredStrategy() 恢復訓練

[英]Tensorflow resume training with MirroredStrategy()

我在 Linux 操作系統上訓練了我的 model,因此我可以使用MirroredStrategy()並在 2 個 GPU 上訓練。 訓練在 epoch 610 停止。我想繼續訓練,但是當我加載我的 model 並對其進行評估時,kernel 死了。 我正在使用 Jupyter 筆記本。 如果我減少我的訓練數據集,代碼將運行,但它只會在 1 GPU 上運行。 我的分發策略是保存在我正在加載的 model 中還是必須再次包含它?

更新

我試圖包括MirroredStrategy()

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():

    new_model = load_model('\\models\\model_0610.h5', 
                custom_objects = {'dice_coef_loss': dice_coef_loss, 
                'dice_coef': dice_coef}, compile = True)
    new_model.evaluate(train_x,  train_y, batch_size = 2,verbose=1)

新錯誤

包含MirroredStrategy()時出錯:

ValueError: 'handle' is not available outside the replica context or a 'tf.distribute.Stragety.update()' call.

源代碼:

smooth = 1
def dice_coef(y_true, y_pred):
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

def dice_coef_loss(y_true, y_pred):
    return (1. - dice_coef(y_true, y_pred))

new_model = load_model('\\models\\model_0610.h5', 
                       custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef}, compile = True)
new_model.evaluate(train_x,  train_y, batch_size = 2,verbose=1)

observe_var = 'dice_coef'
strategy = 'max' # greater dice_coef is better
model_resume_dir = '//models_resume//'

model_checkpoint = ModelCheckpoint(model_resume_dir + 'resume_{epoch:04}.h5', 
                                   monitor=observe_var, mode='auto', save_weights_only=False, 
                                   save_best_only=False, period = 2)

new_model.fit(train_x, train_y, batch_size = 2, epochs = 5000, verbose=1, shuffle = True, 
              validation_split = .15, callbacks = [model_checkpoint])

new_model.save(model_resume_dir + 'final_resume.h5')

new_model.evaluate()compile = True加載 model 時導致問題。 我設置了compile = False並從我的原始腳本中添加了一個編譯行。

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():

    new_model = load_model('\\models\\model_0610.h5', 
                custom_objects = {'dice_coef_loss': dice_coef_loss, 
                'dice_coef': dice_coef}, compile = False)
    new_model.compile(optimizer = Adam(learning_rate = 1e-4, loss = dice_coef_loss,
                metrics = [dice_coef])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM