简体繁体 English

Tensorflow Multi-GPU丢失

[英]Tensorflow Multi-GPU loss

原文 2019-02-14 11:05:54 4 1 python/ tensorflow/ multi-gpu

I am studying how to implement multi-GPU training on Tensorflow. 我正在研究如何在Tensorflow上实施多GPU培训。 Now I am reading this source as recommended in the documentation. 现在，我按照文档中的建议阅读此资源。 As far as I understand, at line 178 variable loss accounts the loss for only one GPU (as the comment states). 据我了解，在第178行，可变损失仅说明了一个GPU的损失（如评论所述）。 Thus, at the end of the cycle, say line 192, loss will retain the value of the loss of the last GPU considered. 因此，在周期的末尾，例如第192行，损失将保留所考虑的最后一个GPU的损失值。 Variable loss is not modified till its use at line 243, when it is passed to Session.run() to be computed. 可变损失直到传递给Session.run（）进行计算后，才在第243行使用。 So the loss value printed at line 255 is only the loss of the last GPU, and not the total one. 因此，在第255行打印的损失值仅是最后一个GPU的损失，而不是总数。 It seems hard to me that Google engineers got wrong such a simple thing, what am I missing? 在我看来，Google工程师这么简单的事情弄错了，我想念的是什么？ Thanks! 谢谢！

1 个解决方案

It doesn't seem that you are missing something. 似乎您没有丢失任何东西。 They consider that printing the value of the loss and reporting the summaries for one tower is sufficient. 他们认为打印损失值并报告一塔的摘要就足够了。

Generally you track the loss/summaries for each GPUs and/or compute the mean loss only for debugging when you start using a new model on multiple GPUs. 通常，当您开始在多个GPU上使用新模型时，您会跟踪每个GPU的损耗/汇总和/或仅计算调试时的平均损耗。 Afterwards, only tracking one tower is sufficient since every tower contains the same copy of the model. 之后，仅跟踪一个塔就足够了，因为每个塔都包含相同的模型副本。

BTW, I find it way easier to use tf.estimators to do multi GPU training, using both tf.contrib.estimator.replicate_model_fn(...) and tf.contrib.estimator.TowerOptimizer(...) to distribute the model and optimizer. 顺便说一句，我发现使用tf.estimators进行多GPU训练更容易，同时使用tf.contrib.estimator.replicate_model_fn(...)和tf.contrib.estimator.TowerOptimizer(...)来分发模型并优化。