简体   繁体   English

Tensorflow 多 GPU - NCCL

[英]Tensorflow Multi-GPU - NCCL

I have been wanting to increase my batch size to improve the generalization of my model (it's very batch size sensitive).我一直想增加我的批量大小以改善我的模型的泛化(它对批量大小非常敏感)。 The solution for this is to go multi-GPU in order to utilize more memory.解决方案是使用多 GPU 以利用更多内存。 I am using tensorflow.keras (with tensorflow 2.1 on Windows 10) in my script, and follow the instructions for configuring mirrored strategy for my model.我在我的脚本中使用 tensorflow.keras(在 Windows 10 上使用 tensorflow 2.1),并按照说明为我的模型配置镜像策略。 The issue is that my training script runs perfectly fine without the mirrored strategy code, but with the mirrored strategy, I get an error regarding NCCL.问题是我的训练脚本在没有镜像策略代码的情况下运行得非常好,但是使用镜像策略,我收到了关于 NCCL 的错误。 This looks to be the exact same issue as:这看起来与以下问题完全相同:

https://github.com/tensorflow/tensorflow/issues/21470 https://github.com/tensorflow/tensorflow/issues/21470

Unfortunately, the solution discussed in that link:不幸的是,该链接中讨论的解决方案:

cross_tower_ops = tf.contrib.distribute.AllReduceCrossDeviceOps(
'hierarchical_copy', num_packs=num_gpus))
strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)

Does not work with tf 2.1 since the 'contrib' portion of tf appears to have been removed.不适用于 tf 2.1,因为 tf 的“贡献”部分似乎已被删除。 Does anyone know what the replacement fix is for NCCL on Windows, or the replacement for the 'contrib' portion of tf that is gone?有谁知道 Windows 上 NCCL 的替换修复是什么,或者替换掉 tf 的“contrib”部分?

One solution from issue 21470 is to build nccl for Winx64.问题 21470 的一种解决方案是为 Winx64 构建 nccl。 MyCaffe provides instructions for that here: https://github.com/MyCaffe/NCCL/blob/master/INSTALL.md MyCaffe 在此处提供了相关说明: https : //github.com/MyCaffe/NCCL/blob/master/INSTALL.md

You'll need VS 2015, 2017, CUDA development package, and to put the produced .dlls in the correct location once compiled.您将需要 VS 2015、2017、CUDA 开发包,并在编译后将生成的 .dll 放在正确的位置。

In my experience some cross_device_ops would not work and produce errors.根据我的经验,一些cross_device_ops不起作用并产生错误。

This option was meant for NVIDIA DGX-1 architecture and might underperform on other architectures :此选项适用于 NVIDIA DGX-1 架构,可能在其他架构上表现不佳:

strategy = tf.distribute.MirroredStrategy(
    cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

Should work :应该管用 :

strategy = tf.distribute.MirroredStrategy(
     cross_device_ops=tf.distribute.ReductionToOneDevice())

Would not work with my configuration :不适用于我的配置:

strategy = tf.distribute.MirroredStrategy(
     cross_device_ops=tf.distribute.NcclAllReduce())

So that it can be advised to try the different options.以便可以建议尝试不同的选项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM