[英]Tensorflow distributed training pause after each epoch
I am training a neural network in parallel on 2 GPUs using the Tensorflow MirroredStrategy.我正在使用 Tensorflow MirroredStrategy 在 2 个 GPU 上并行训练神经网络。 With a single GPU, each epoch takes 19 seconds to complete whereas with 2 GPUs, each epoch takes 13 seconds to finish.使用单个 GPU,每个 epoch 需要 19 秒才能完成,而使用 2 个 GPU,每个 epoch 需要 13 秒才能完成。 I am not surprised at this since I know the scaling is not perfect due to the all_reduce overhead for updating the variables during training.我对此并不感到惊讶,因为我知道由于在训练期间更新变量的 all_reduce 开销,缩放并不完美。
However, after each epoch of the distributed training, there is a pause of about 8 seconds.但是,在分布式训练的每个 epoch 之后,都会有大约 8 秒的停顿。 When using a single GPU, this pause is less than 1 second.使用单个 GPU 时,此暂停时间小于 1 秒。 Does anyone know why there is such a long pause after each epoch when training distributed?有谁知道为什么在分布训练时每个 epoch 后会有这么长时间的停顿?
Alternatively, can anyone explain what happens differently in distributed training at the end of an epoch?或者,任何人都可以解释在一个时代结束时分布式训练中发生的不同情况吗?
Apparently this had something to do with running TF in graph mode.显然这与在图形模式下运行 TF 有关。 By setting tf.compat.v1.enable_eager_execution()
the problem went away.通过设置tf.compat.v1.enable_eager_execution()
问题消失了。 This also fixed a memory leak that was causing issues, so perhaps the pause was being caused by TF making copies of something that I wasn't expecting.这也修复了导致问题的 memory 泄漏,因此暂停可能是由于 TF 复制了我没想到的东西造成的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.