简体   繁体   English

Tensorflow 每个 epoch 后的分布式训练暂停

[英]Tensorflow distributed training pause after each epoch

I am training a neural network in parallel on 2 GPUs using the Tensorflow MirroredStrategy.我正在使用 Tensorflow MirroredStrategy 在 2 个 GPU 上并行训练神经网络。 With a single GPU, each epoch takes 19 seconds to complete whereas with 2 GPUs, each epoch takes 13 seconds to finish.使用单个 GPU,每个 epoch 需要 19 秒才能完成,而使用 2 个 GPU,每个 epoch 需要 13 秒才能完成。 I am not surprised at this since I know the scaling is not perfect due to the all_reduce overhead for updating the variables during training.我对此并不感到惊讶,因为我知道由于在训练期间更新变量的 all_reduce 开销,缩放并不完美。

However, after each epoch of the distributed training, there is a pause of about 8 seconds.但是,在分布式训练的每个 epoch 之后,都会有大约 8 秒的停顿。 When using a single GPU, this pause is less than 1 second.使用单个 GPU 时,此暂停时间小于 1 秒。 Does anyone know why there is such a long pause after each epoch when training distributed?有谁知道为什么在分布训练时每个 epoch 后会有这么长时间的停顿?

Alternatively, can anyone explain what happens differently in distributed training at the end of an epoch?或者,任何人都可以解释在一个时代结束时分布式训练中发生的不同情况吗?

Apparently this had something to do with running TF in graph mode.显然这与在图形模式下运行 TF 有关。 By setting tf.compat.v1.enable_eager_execution() the problem went away.通过设置tf.compat.v1.enable_eager_execution()问题消失了。 This also fixed a memory leak that was causing issues, so perhaps the pause was being caused by TF making copies of something that I wasn't expecting.这也修复了导致问题的 memory 泄漏,因此暂停可能是由于 TF 复制了我没想到的东西造成的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Tensorflow 每个时期对数据集的不同子集进行训练 - Tensorflow training on different subset of dataset each epoch 如何在 Keras 的每个训练周期后进行预测? - How to predict after each epoch of training in Keras? 如何在 tensorflow 1.x 的每个训练时期保持模型的输出? - How to hold the output of a model at each training epoch in tensorflow 1.x? 在 Tensorflow 2 中的每个 epoch 之后计算每个类的召回率 - Calculate recall for each class after each epoch in Tensorflow 2 张量流分布式培训中的FLAGS和解析器 - FLAGS and parsers in tensorflow distributed training Tensorflow输入管道用于分布式培训 - Tensorflow input pipeline for distributed training 从Tensorflow 1.8.0升级到1.11.0后每个时期记录的OutOfRangeError - OutOfRangeError logged at each epoch after upgrade from Tensorflow 1.8.0 to 1.11.0 在训练期间如何在每个 epoch 结束时调用测试集? 我正在使用张量流 - How can I call a test set at the end of each epoch during the training? I am using tensorflow tensorflow-keras 如何计算每个 epoch 的训练成本? - How does tensorflow-keras calculate the cost during training in each epoch? 如何在训练运行之间的中间层内的每个时期更新参数? (张量流急切执行) - How to update parameter at each epoch within an intermediate Layer between training runs ? (tensorflow eager execution)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM