将 Horovod 与 tf.keras 一起使用时如何从检查点恢复？

Question

Note: I'm using TF 2.1.0 and the tf.keras API.注意：我使用的是 TF 2.1.0 和 tf.keras API。 I've experienced the below issue with all Horovod versions between 0.18 and 0.19.2.我在 0.18 和 0.19.2 之间的所有 Horovod 版本中都遇到了以下问题。

Are we supposed to call hvd.load_model() on all ranks when resuming from a tf.keras h5 checkpoint, or are we only supposed to call it on rank 0 and let the BroadcastGlobalVariablesCallback callback share these weights with the other workers?从 tf.keras h5 检查点恢复时，我们是否应该在所有等级上调用hvd.load_model() ，或者我们只应该在等级 0 上调用它并让BroadcastGlobalVariablesCallback回调与其他工作人员共享这些权重？ Is approach 1 incorrect/invalid, in that it will mess up training or produce different results than approach 2?方法 1 是否不正确/无效，因为它会扰乱训练或产生与方法 2 不同的结果？

I'm currently training a ResNet-based model with some BatchNorm layers, and if we only try to load the model on the first rank (and build/compile the model on the other ranks), we get a stalled tensor issue ( https://github.com/horovod/horovod/issues/1271 ). I'm currently training a ResNet-based model with some BatchNorm layers, and if we only try to load the model on the first rank (and build/compile the model on the other ranks), we get a stalled tensor issue ( https: //github.com/horovod/horovod/issues/1271 ）。 However, if we call hvd.load_model on all ranks when resuming, training starts resuming normally but it seems to immediately diverge, so I was confused as to whether loading the checkpoint model on all ranks (with hvd.load_model ) can somehow cause training to diverge?但是，如果我们在恢复时在所有等级上调用hvd.load_model ，训练开始正常恢复但它似乎立即发散，所以我很困惑是否在所有等级上加载检查点 model （使用hvd.load_model ）会以某种方式导致训练分歧？ But at the same time, we're unable to only load it on rank 0 because of https://github.com/horovod/horovod/issues/1271 , causing Batch Norm to hang in horovod.但同时，由于https://github.com/horovod/horovod/issues/1271 ，我们无法仅将其加载到 rank 0，导致 Batch Norm 在 horovod 中挂起。 Has anyone been able to successfully call hvd.load_model only on rank 0 when using BatchNorm tf.keras layers?使用 BatchNorm tf.keras 层时，是否有人能够仅在等级 0 上成功调用hvd.load_model ？ Can someone please provide some tips here?有人可以在这里提供一些提示吗？

Thanks!谢谢！

Answer 1

According to this: https://github.com/horovod/horovod/issues/120 , this is the solution:根据这个： https://github.com/horovod/horovod/issues/120 ，这是解决方案：

You should also be able to specify optimizer via custom object:
model = keras.models.load_model('file.h5', custom_objects={
    'Adam': lambda **kwargs: hvd.DistributedOptimizer(keras.optimizers.Adam(**kwargs))
})

将 Horovod 与 tf.keras 一起使用时如何从检查点恢复？

问题描述

1 个解决方案

解决方案1
0 2020-07-01 03:36:48

将 Horovod 与 tf.keras 一起使用时如何从检查点恢复？

问题描述

1 个解决方案

解决方案1 0 2020-07-01 03:36:48

解决方案1
0 2020-07-01 03:36:48