Pytorch 多 GPU 问题

Question

I want to train my model with 2 GPU(id 5, 6), so I run my code with CUDA_VISIBLE_DEVICES=5,6 train.py .我想用 2 个 GPU(id 5, 6) 训练我的模型，所以我用CUDA_VISIBLE_DEVICES=5,6 train.py运行我的代码。 However, when I printed torch.cuda.current_device I still got the id 0 rather than 5,6.但是，当我打印 torch.cuda.current_device 时，我仍然得到 id 0而不是 5,6。 But torch.cuda.device_count is 2 , which semms right.但是 torch.cuda.device_count 是2 ，这似乎是正确的。 How can I use GPU5,6 correctly?如何正确使用 GPU5,6？

Answer 1

It is most likely correct.这很可能是正确的。 PyTorch only sees two GPUs (therefore indexed 0 and 1) which are actually your GPU 5 and 6. PyTorch 只能看到两个 GPU（因此索引为 0 和 1），它们实际上是您的 GPU 5 和 6。

Check the actual usage with nvidia-smi .使用nvidia-smi检查实际使用情况。 If it is still inconsistent, you might need to set an environment variable:如果仍然不一致，您可能需要设置一个环境变量：

export CUDA_DEVICE_ORDER=PCI_BUS_ID

(See Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName() ) （请参阅'nvidia-smi -L' 和 cuDeviceGetName() 之间的 ID 不一致）

Answer 2

you can check the device name to verify whether that is the correct name of that GPU.您可以检查设备名称以验证它是否是该 GPU 的正确名称。 However, I think when you set the Cuda_Visible outside, you have forced torch to look only at that 2 gpu.但是，我认为当您将 Cuda_Visible 设置在外面时，您已经迫使手电筒只看那 2 个 gpu。 So torch will manually set index for them as 0 and 1. Because of this, when you check the current_device, it will output 0所以torch会手动为它们设置index为0和1。因此，当你检查current_device时，它会输出0

Pytorch 多 GPU 问题

问题描述

2 个解决方案

解决方案1
1 2020-09-19 10:19:43

解决方案2
0 2020-09-19 13:08:56

Pytorch 多 GPU 问题

问题描述

2 个解决方案

解决方案1 1 2020-09-19 10:19:43

解决方案2 0 2020-09-19 13:08:56

解决方案1
1 2020-09-19 10:19:43

解决方案2
0 2020-09-19 13:08:56