是否可以从 Tensorflow 中的检查点 model 恢复训练？

Question

I am doing auto segmentation and I was training a model over the weekend and the power went out.我正在做自动分割，我在周末训练了一个 model 并且停电了。 I had trained my model for 50+ hours and saved my model every 5 epochs using the line:我已经训练了我的 model 50 多个小时，并使用以下行每 5 个时期保存我的 model：

model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)

I'm loading the saved model using the line:我正在使用以下行加载保存的 model：

model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})

I have included all of my data that splits my training data into train_x for the scan and train_y for the label.我已经包含了我的所有数据，这些数据将我的训练数据拆分为用于扫描的train_x和用于train_y的 train_y。 When I run the line:当我运行该行时：

loss, dice_coef = model.evaluate(train_x,  train_y, verbose=1)

I get the error:我得到错误：

ResourceExhaustedError:  OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
 [[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_3673]

Function call stack:
distributed_function

Answer 1

This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.这基本上是你的 memory 用完了。所以你需要以小批量方式进行评估。默认批量大小为 32 并尝试分配小批量大小。

evaluate(train_x,  train_y, batch_size=<batch size>)

from keras documentation来自keras 文档

batch_size: Integer or None. batch_size：Integer 或无。 Number of samples per gradient update.每次梯度更新的样本数。 If unspecified, batch_size will default to 32.如果未指定，batch_size 将默认为 32。

是否可以从 Tensorflow 中的检查点 model 恢复训练？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-28 00:39:51

是否可以从 Tensorflow 中的检查点 model 恢复训练？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-28 00:39:51

解决方案1
1 已采纳 2020-04-28 00:39:51