简体繁体 English

TensorFlow：是否可以为多GPU训练恢复检查点模型？

[英]TensorFlow: Is it possible to restore checkpoint models for multi-gpu training?

原文 2017-02-22 09:30:00 2 1 python/ machine-learning/ tensorflow/ deep-learning

I am currently using a supervisor and constructed just one graph to perform transfer learning using the pre-trained weights from TF-slim. 我目前正在使用主管，并且仅使用TF-slim的预训练权重构建了一张图来进行转移学习。 I am wondering if there is a way to restore checkpoint models to multiple inference models at the outset? 我想知道是否有一种方法可以在一开始就将检查点模型还原到多个推理模型？ My primary concern is that firstly, the name scopes that are defined as in a reference code on the TF repository may cause the pre-trained variables to be unable to be restored due to a name mismatch. 我主要关心的是，首先，在TF存储库上的参考代码中定义的名称范围可能会由于名称不匹配而导致无法恢复预训练变量。 Also, given that I have to use a supervisor with an init_fn that takes in only one saver that restores the variables, how could I have multiple savers to restore the same variables to multiple GPUs (If I even need to have multiple savers at all). 此外，鉴于我必须使用带有init_fn的管理程序，该管理器仅使用一个可恢复变量的保护程序，我如何才能有多个保护程序将相同的变量恢复到多个GPU（如果我甚至根本需要多个保护程序）。

One idea I have is that perhaps I could just restore the variables to one graph, and let the other GPUs use the same graph for training. 我的一个想法是，也许我可以将变量还原到一个图形，然后让其他GPU使用相同的图形进行训练。 However, would the training for the next GPU take place only after the first GPU has completed? 但是，是否仅在第一个GPU完成后才进行下一个GPU的培训？ But this way, I won't be able to restore the weights according to the original checkpoint model variable names too, unless I edit the names of the checkpoint weights. 但是通过这种方式，除非我编辑检查点权重的名称，否则我也将无法根据原始检查点模型变量名称来恢复权重。