简体   繁体   English

使用 tf.distribute.MirroredStrategy() 时无法加载 tensorflow keras 检查点

[英]Can't load tensorflow keras checkpoint when using tf.distribute.MirroredStrategy()

I'm trying to load a tf.keras (v1.15.0) model from a checkpoint created with the ModelCheckpoint callback, modify it by removing several layers and adding new ones, and then continue training it on a new task.我正在尝试从使用 ModelCheckpoint 回调创建的检查点加载 tf.keras (v1.15.0) 模型,通过删除多个层并添加新层来修改它,然后继续在新任务上对其进行训练。 I'm using the tf.distribute.MirroredStrategy() to do distributed training with 2 gpus.我正在使用 tf.distribute.MirroredStrategy() 用 2 gpus 进行分布式训练。

strategy = tensorflow.distribute.MirroredStrategy()
with strategy.scope():

    # Load pretrained model from checkpoint
    model = get_model()
    model.load_weights('file_name.hdf5')

    # Chop off some layers, add new layers
    model = modify_pretrained_model(model)

    model.compile(optimizer=opt, loss=loss)

The model loads fine and compiles and I can run model.summary(), but when I call model.fit(), or model.predict() I get the following errors in my python stack:模型加载良好并编译,我可以运行 model.summary(),但是当我调用 model.fit() 或 model.predict() 时,我的 python 堆栈中出现以下错误:

  (0) Failed precondition: Error while reading resource variable compression0_conv0_batchnorm/moving_variance from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/compression0_conv0_batchnorm/moving_variance/N10tensorflow3VarE does not exist.
     [[{{node time_distributed_1/model_1/compression0_conv0_batchnorm/FusedBatchNormV3/ReadVariableOp_1}}]]
     [[dense_1_1/Sigmoid/_225]]
  (1) Failed precondition: Error while reading resource variable compression0_conv0_batchnorm/moving_variance from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/compression0_conv0_batchnorm/moving_variance/N10tensorflow3VarE does not exist.
     [[{{node time_distributed_1/model_1/compression0_conv0_batchnorm/FusedBatchNormV3/ReadVariableOp_1}}]]
0 successful operations.
1 derived errors ignored

This issue seems to fix this exact issue but without using tf.distribute to continue training.这个问题似乎解决了这个确切的问题,但没有使用 tf.distribute 继续训练。

When I instantiate a session outside of the distribute scope, and set a reference to it inside the distribute scope the code crashes with the same error.当我在分发范围之外实例化一个会话,并在分发范围内设置对它的引用时,代码会因相同的错误而崩溃。

tf_config = some_custom_config
sess = tf.Session(config=tf_config)
graph = tf.get_default_graph()

strategy = tensorflow.distribute.MirroredStrategy()
with strategy.scope():

    set_session(sess)

    # Load pretrained model from checkpoint
    model = get_model()
    model.load_weights('file_name.hdf5')

    # Chop off some layers, add new layers
    model = modify_pretrained_model(model)

    model.compile(optimizer=opt, loss=loss)

I spent a good 2-3 days trying to figure this out.我花了 2-3 天的时间试图弄清楚这一点。 The only thing that really worked was upgrading to tf 2.0.0.唯一真正有效的是升级到 tf 2.0.0。 Then everything worked like magic.然后一切都像魔术一样运作。 Alternatively as a last resort, I was able train the first model, add and remove additional layers, recompile, and continue training in the same python execution with the same distribution strategy, but was never able to reload a tf.keras ModelCheckpoint using distribution strategies in tf 1.15.0.或者作为最后的手段,我能够训练第一个模型,添加和删除额外的层,重新编译,并使用相同的分发策略在同一个 python 执行中继续训练,但永远无法使用分发策略重新加载 tf.keras ModelCheckpoint在 tf 1.15.0 中。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Tensorflow — 使用“tf.distribute.MirroredStrategy”时无法调用“tf.keras.Model.add_metric” - Tensorflow — Cannot call `tf.keras.Model.add_metric` when `tf.distribute.MirroredStrategy` is used Tensorflow 不使用 tf.distribute.MirroredStrategy() 检测到多个 CPU 内核 - Tensorflow does not detect multiple CPU cores with tf.distribute.MirroredStrategy() 使用 tensorflow-gpu 1.14 和 tf.distribute.MirroredStrategy() 的自定义训练循环导致 ValueError - Custom training loop using tensorflow-gpu 1.14 and tf.distribute.MirroredStrategy() results in ValueError `tf.distribute.MirroredStrategy` 对训练结果有影响吗? - Does `tf.distribute.MirroredStrategy` have an impact on training outcome? 无法将嵌入层与 tf.distribute.MirroredStrategy 一起使用 - Not able to use Embedding Layer with tf.distribute.MirroredStrategy RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call() 同时使用 tf.distribute.MirroredStrategy - RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call() while using tf.distribute.MirroredStrategy 使用检查点时,keras不会加载模型和权重 - keras dosen't load the model and weights when using checkpoint 加载张量流检查点作为keras模型 - Load tensorflow checkpoint as keras model 将 Horovod 与 tf.keras 一起使用时如何从检查点恢复? - How to resume from a checkpoint when using Horovod with tf.keras? 使用tf.distribute时,如何避免在每个tf.keras纪元上重新填充tf.data随机缓冲区? - How can I avoid repopulating my tf.data shuffle buffer on each tf.keras epoch when using tf.distribute?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM