使用 tf.distribute.MirroredStrategy() 时无法加载 tensorflow keras 检查点

Question

I'm trying to load a tf.keras (v1.15.0) model from a checkpoint created with the ModelCheckpoint callback, modify it by removing several layers and adding new ones, and then continue training it on a new task.我正在尝试从使用 ModelCheckpoint 回调创建的检查点加载 tf.keras (v1.15.0) 模型，通过删除多个层并添加新层来修改它，然后继续在新任务上对其进行训练。 I'm using the tf.distribute.MirroredStrategy() to do distributed training with 2 gpus.我正在使用 tf.distribute.MirroredStrategy() 用 2 gpus 进行分布式训练。

strategy = tensorflow.distribute.MirroredStrategy()
with strategy.scope():

    # Load pretrained model from checkpoint
    model = get_model()
    model.load_weights('file_name.hdf5')

    # Chop off some layers, add new layers
    model = modify_pretrained_model(model)

    model.compile(optimizer=opt, loss=loss)

The model loads fine and compiles and I can run model.summary(), but when I call model.fit(), or model.predict() I get the following errors in my python stack:模型加载良好并编译，我可以运行 model.summary()，但是当我调用 model.fit() 或 model.predict() 时，我的 python 堆栈中出现以下错误：

  (0) Failed precondition: Error while reading resource variable compression0_conv0_batchnorm/moving_variance from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/compression0_conv0_batchnorm/moving_variance/N10tensorflow3VarE does not exist.
     [[{{node time_distributed_1/model_1/compression0_conv0_batchnorm/FusedBatchNormV3/ReadVariableOp_1}}]]
     [[dense_1_1/Sigmoid/_225]]
  (1) Failed precondition: Error while reading resource variable compression0_conv0_batchnorm/moving_variance from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/compression0_conv0_batchnorm/moving_variance/N10tensorflow3VarE does not exist.
     [[{{node time_distributed_1/model_1/compression0_conv0_batchnorm/FusedBatchNormV3/ReadVariableOp_1}}]]
0 successful operations.
1 derived errors ignored

This issue seems to fix this exact issue but without using tf.distribute to continue training.这个问题似乎解决了这个确切的问题，但没有使用 tf.distribute 继续训练。

When I instantiate a session outside of the distribute scope, and set a reference to it inside the distribute scope the code crashes with the same error.当我在分发范围之外实例化一个会话，并在分发范围内设置对它的引用时，代码会因相同的错误而崩溃。

tf_config = some_custom_config
sess = tf.Session(config=tf_config)
graph = tf.get_default_graph()

strategy = tensorflow.distribute.MirroredStrategy()
with strategy.scope():

    set_session(sess)

    # Load pretrained model from checkpoint
    model = get_model()
    model.load_weights('file_name.hdf5')

    # Chop off some layers, add new layers
    model = modify_pretrained_model(model)

    model.compile(optimizer=opt, loss=loss)

Answer 1

I spent a good 2-3 days trying to figure this out.我花了 2-3 天的时间试图弄清楚这一点。 The only thing that really worked was upgrading to tf 2.0.0.唯一真正有效的是升级到 tf 2.0.0。 Then everything worked like magic.然后一切都像魔术一样运作。 Alternatively as a last resort, I was able train the first model, add and remove additional layers, recompile, and continue training in the same python execution with the same distribution strategy, but was never able to reload a tf.keras ModelCheckpoint using distribution strategies in tf 1.15.0.或者作为最后的手段，我能够训练第一个模型，添加和删除额外的层，重新编译，并使用相同的分发策略在同一个 python 执行中继续训练，但永远无法使用分发策略重新加载 tf.keras ModelCheckpoint在 tf 1.15.0 中。

使用 tf.distribute.MirroredStrategy() 时无法加载 tensorflow keras 检查点

问题描述

1 个解决方案

解决方案1
0 2020-03-11 00:58:22

使用 tf.distribute.MirroredStrategy() 时无法加载 tensorflow keras 检查点

问题描述

1 个解决方案

解决方案1 0 2020-03-11 00:58:22

解决方案1
0 2020-03-11 00:58:22