简体   繁体   English

是否可以从 Tensorflow 中的检查点 model 恢复训练?

[英]Is it possible to resume training from a checkpoint model in Tensorflow?

I am doing auto segmentation and I was training a model over the weekend and the power went out.我正在做自动分割,我在周末训练了一个 model 并且停电了。 I had trained my model for 50+ hours and saved my model every 5 epochs using the line:我已经训练了我的 model 50 多个小时,并使用以下行每 5 个时期保存我的 model:

model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)

I'm loading the saved model using the line:我正在使用以下行加载保存的 model:

model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})

I have included all of my data that splits my training data into train_x for the scan and train_y for the label.我已经包含了我的所有数据,这些数据将我的训练数据拆分为用于扫描的train_x和用于train_y的 train_y。 When I run the line:当我运行该行时:

loss, dice_coef = model.evaluate(train_x,  train_y, verbose=1)

I get the error:我得到错误:

ResourceExhaustedError:  OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
 [[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_3673]

Function call stack:
distributed_function

This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.这基本上是你的 memory 用完了。所以你需要以小批量方式进行评估。默认批量大小为 32 并尝试分配小批量大小。

evaluate(train_x,  train_y, batch_size=<batch size>)

from keras documentation来自keras 文档

batch_size: Integer or None. batch_size:Integer 或无。 Number of samples per gradient update.每次梯度更新的样本数。 If unspecified, batch_size will default to 32.如果未指定,batch_size 将默认为 32。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 TensorFlow/Keras:如何使用 model.checkpoint() 恢复训练? - TensorFlow/Keras: How to resume training using model.checkpoint()? 如何从张量流检查点文件正确恢复网络训练? - How to resume properly the training of a network from a tensorflow checkpoint file? Tensorflow Keras无法在初始时从检查点文件正确恢复训练 - Tensorflow Keras cannot properly resume training at initial epoch from checkpoint file 我正在尝试从某个检查点(Tensorflow)恢复训练,因为我使用的是 Colab,而 12 小时还不够 - I am trying to resume training from a certain checkpoint (Tensorflow) because I'm using Colab and 12 hours aren't enough Huggingface Transformer - GPT2 从保存的检查点恢复训练 - Huggingface Transformer - GPT2 resume training from saved checkpoint TensorFlow:是否可以为多GPU训练恢复检查点模型? - TensorFlow: Is it possible to restore checkpoint models for multi-gpu training? 如何在张量流中从* .meta恢复训练? - How to resume training from *.meta in tensorflow? Tensorflow 停止并恢复训练 - Tensorflow stop and resume training Tensorflow 使用 MirroredStrategy() 恢复训练 - Tensorflow resume training with MirroredStrategy() 从张量流检查点加载特定模型时出错 - Error in loading particular model from tensorflow checkpoint
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM