[英]How to resume from a checkpoint when using Horovod with tf.keras?
[英]Resume Training tf.keras Tensorboard
当我继续训练我的模型并在 tensorboard 上可视化进度时,我遇到了一些问题。
我的问题是如何在不手动指定任何时期的情况下从同一步骤恢复训练? 如果可能,只需加载保存的模型,它就可以从保存的优化器中读取global_step
并从那里继续训练。
我在下面提供了一些代码来重现类似的错误。
import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.models import load_model
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
model.save('./final_model.h5', include_optimizer=True)
del model
model = load_model('./final_model.h5')
model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
您可以使用以下命令运行张量tensorboard
:
tensorboard --logdir ./logs
您可以将函数model.fit()
的参数initial_epoch
设置为您希望训练开始的时期数。 考虑到模型火车,直到指数的划时代epochs
达到(而不是迭代次数由下式给出epochs
)。 在您的示例中,如果您想再训练 10 个时期,则应该是:
model.fit(x_train, y_train, initial_epoch=9, epochs=19, callbacks=[Tensorboard()])
它将允许您以正确的方式在 Tensorboard 上可视化您的绘图。 更多关于这些参数的信息可以在文档中找到。
这是示例代码,以防有人需要它。 它实现了 Abhinav Anand 提出的想法:
mca = ModelCheckpoint(join(dir, 'model_{epoch:03d}.h5'),
monitor = 'loss',
save_best_only = False)
tb = TensorBoard(log_dir = join(dir, 'logs'),
write_graph = True,
write_images = True)
files = sorted(glob(join(fold_dir, 'model_???.h5')))
if files:
model_file = files[-1]
initial_epoch = int(model_file[-6:-3])
print('Resuming using saved model %s.' % model_file)
model = load_model(model_file)
else:
model = nn.model()
initial_epoch = 0
model.fit(x_train,
y_train,
epochs = 100,
initial_epoch = initial_epoch,
callbacks = [mca, tb])
将nn.model()
替换为您自己的用于定义模型的函数。
这很简单。 在训练模型时创建检查点,然后使用这些检查点从您离开的地方恢复训练。
import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
model.save('./final_model.h5', include_optimizer=True)
model = load_model('./final_model.h5')
callbacks = list()
tensorboard = Tensorboard()
callbacks.append(tensorboard)
file_path = "model-{epoch:02d}-{loss:.4f}.hdf5"
# now here you can create checkpoints and save according to your need
# here period is the no of epochs after which to save the model every time during training
# another option is save_weights_only, for your case it should be false
checkpoints = ModelCheckpoint(file_path, monitor='loss', verbose=1, period=1, save_weights_only=False)
callbacks.append(checkpoints)
model.fit(x_train, y_train, epochs=10, callbacks=callbacks)
在此之后,只需从您想再次恢复训练的位置加载检查点
model = load_model(checkpoint_of_choice)
model.fit(x_train, y_train, epochs=10, callbacks=callbacks)
你已经完成了。
如果您对此有更多疑问,请告诉我。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.