[英]Is it possible to resume training from a checkpoint model in Tensorflow?
I am doing auto segmentation and I was training a model over the weekend and the power went out.我正在做自动分割,我在周末训练了一个 model 并且停电了。 I had trained my model for 50+ hours and saved my model every 5 epochs using the line:
我已经训练了我的 model 50 多个小时,并使用以下行每 5 个时期保存我的 model:
model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)
I'm loading the saved model using the line:我正在使用以下行加载保存的 model:
model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})
I have included all of my data that splits my training data into train_x
for the scan and train_y
for the label.我已经包含了我的所有数据,这些数据将我的训练数据拆分为用于扫描的
train_x
和用于train_y
的 train_y。 When I run the line:当我运行该行时:
loss, dice_coef = model.evaluate(train_x, train_y, verbose=1)
I get the error:我得到错误:
ResourceExhaustedError: OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_distributed_function_3673]
Function call stack:
distributed_function
This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.这基本上是你的 memory 用完了。所以你需要以小批量方式进行评估。默认批量大小为 32 并尝试分配小批量大小。
evaluate(train_x, train_y, batch_size=<batch size>)
from keras documentation来自keras 文档
batch_size: Integer or None.
batch_size:Integer 或无。 Number of samples per gradient update.
每次梯度更新的样本数。 If unspecified, batch_size will default to 32.
如果未指定,batch_size 将默认为 32。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.