简体   繁体   English

Pytorch 多 GPU 问题

[英]Pytorch Multi-GPU Issue

I want to train my model with 2 GPU(id 5, 6), so I run my code with CUDA_VISIBLE_DEVICES=5,6 train.py .我想用 2 个 GPU(id 5, 6) 训练我的模型,所以我用CUDA_VISIBLE_DEVICES=5,6 train.py运行我的代码。 However, when I printed torch.cuda.current_device I still got the id 0 rather than 5,6.但是,当我打印 torch.cuda.current_device 时,我仍然得到 id 0而不是 5,6。 But torch.cuda.device_count is 2 , which semms right.但是 torch.cuda.device_count 是2 ,这似乎是正确的。 How can I use GPU5,6 correctly?如何正确使用 GPU5,6?

It is most likely correct.这很可能是正确的。 PyTorch only sees two GPUs (therefore indexed 0 and 1) which are actually your GPU 5 and 6. PyTorch 只能看到两个 GPU(因此索引为 0 和 1),它们实际上是您的 GPU 5 和 6。

Check the actual usage with nvidia-smi .使用nvidia-smi检查实际使用情况。 If it is still inconsistent, you might need to set an environment variable:如果仍然不一致,您可能需要设置一个环境变量:

export CUDA_DEVICE_ORDER=PCI_BUS_ID

(See Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName() ) (请参阅'nvidia-smi -L' 和 cuDeviceGetName() 之间的 ID 不一致

you can check the device name to verify whether that is the correct name of that GPU.您可以检查设备名称以验证它是否是该 GPU 的正确名称。 However, I think when you set the Cuda_Visible outside, you have forced torch to look only at that 2 gpu.但是,我认为当您将 Cuda_Visible 设置在外面时,您已经迫使手电筒只看那 2 个 gpu。 So torch will manually set index for them as 0 and 1. Because of this, when you check the current_device, it will output 0所以torch会手动为它们设置index为0和1。因此,当你检查current_device时,它会输出0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM