简体   繁体   English

如何在 pytorch 中使用多个 GPU?

[英]How to use multiple GPUs in pytorch?

I use this command to use a GPU.我用这个命令来使用一个GPU。

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

But, I want to use two GPUs in jupyter , like this:但是,我想在jupyter中使用两个 GPU,如下所示:

device = torch.device("cuda:0,1" if torch.cuda.is_available() else "cpu")

Using multi-GPUs is as simply as wrapping a model in DataParallel and increasing the batch size.使用多 GPU 就像在DataParallel包装模型并增加批量大小一样简单。 Check these two tutorials for a quick start:查看这两个教程以快速入门:

Assuming that you want to distribute the data across the available GPUs (If you have batch size of 16, and 2 GPUs, you might be looking providing the 8 samples to each of the GPUs), and not really spread out the parts of models across difference GPU's.假设您想将数据分布在可用的 GPU 上(如果您的批次大小为 16 和 2 个 GPU,您可能希望为每个 GPU 提供 8 个样本),而不是真正将模型的各个部分分布在GPU的区别。 This can be done as follows:这可以按如下方式完成:

If you want to use all the available GPUs:如果您想使用所有可用的 GPU:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = CreateModel()

model= nn.DataParallel(model)
model.to(device)

If you want to use specific GPUs: (For example, using 2 out of 4 GPUs)如果您想使用特定的 GPU:(例如,使用 4 个 GPU 中的 2 个)

device = torch.device("cuda:1,3" if torch.cuda.is_available() else "cpu") ## specify the GPU id's, GPU id's start from 0.

model = CreateModel()

model= nn.DataParallel(model,device_ids = [1, 3])
model.to(device)

To use the specific GPU's by setting OS environment variable:要通过设置操作系统环境变量来使用特定 GPU:

Before executing the program, set CUDA_VISIBLE_DEVICES variable as follows:在执行程序之前,设置CUDA_VISIBLE_DEVICES变量如下:

export CUDA_VISIBLE_DEVICES=1,3 (Assuming you want to select 2nd and 4th GPU) export CUDA_VISIBLE_DEVICES=1,3 (假设您要选择第 2 个和第 4 个 GPU)

Then, within program, you can just use DataParallel() as though you want to use all the GPUs.然后,在程序中,您可以像使用所有 GPU 一样使用DataParallel() (similar to 1st case). (类似于第一种情况)。 Here the GPUs available for the program is restricted by the OS environment variable.这里可供程序使用的 GPU 受操作系统环境变量的限制。

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = CreateModel()

model= nn.DataParallel(model)
model.to(device)

In all of these cases, the data has to be mapped to the device.在所有这些情况下,数据都必须映射到设备。

If X and y are the data:如果Xy是数据:

X.to(device)
y.to(device)

Another option would be to use some helper libraries for PyTorch:另一种选择是为 PyTorch 使用一些帮助程序库:

PyTorch Ignite library Distributed GPU training PyTorch Ignite 库分布式 GPU 训练

In there there is a concept of context manager for distributed configuration on:其中有一个用于分布式配置的上下文管理器的概念:

  • nccl - torch native distributed configuration on multiple GPUs nccl - 多个 GPU 上的火炬本机分布式配置
  • xla-tpu - TPUs distributed configuration xla-tpu - TPU 分布式配置

PyTorch Lightning Multi-GPU training PyTorch Lightning Multi-GPU 训练

This is of possible the best option IMHO to train on CPU/GPU/TPU without changing your original PyTorch code.恕我直言, 可能是在不更改原始 PyTorch 代码的情况下在 CPU/GPU/TPU 上训练的最佳选择。

Worth cheking Catalyst for similar distributed GPU options.对于类似的分布式 GPU 选项,值得一试Catalyst

In 2022, PyTorch says: 2022 年,PyTorch 说:

It is recommended to use DistributedDataParallel, instead of this class, to do multi-GPU training, even if there is only a single node.建议使用 DistributedDataParallel,而不是这个 class,来做多 GPU 训练,即使只有一个节点。 See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel and Distributed Data Parallel.请参阅:使用 nn.parallel.DistributedDataParallel 而不是多处理或 nn.DataParallel 和分布式数据并行。

in https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallelhttps://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel

Thus, seems that we should use DistributedDataParallel , not DataParallel .因此,似乎我们应该使用DistributedDataParallel ,而不是DataParallel

When I ran naiveinception_googlenet, the above methods didn't work for me.当我运行 naiveinception_googlenet 时,上述方法对我不起作用。 The following method solved my problem.以下方法解决了我的问题。

import os导入操作系统

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"

os.environ["CUDA_VISIBLE_DEVICES"]="0,3" # specify which GPU(s) to be used os.environ["CUDA_VISIBLE_DEVICES"]="0,3" # 指定要使用的 GPU

If you want to run your code only on specific GPUs (eg only on GPU id 2 and 3), then you can specify that using the CUDA_VISIBLE_DEVICES=2,3 variable when triggering the python code from terminal.如果您只想在特定 GPU 上运行您的代码(例如,仅在 GPU id 2 和 3 上),那么您可以在从终端触发 python 代码时指定使用 CUDA_VISIBLE_DEVICES=2,3 变量。

CUDA_VISIBLE_DEVICES=2,3 python lstm_demo_example.py --epochs=30 --lr=0.001

and inside the code, leave it as:在代码中,将其保留为:

device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
model = LSTMModel()
model = nn.DataParallel(model)
model = model.to(device)

Source : https://glassboxmedicine.com/2020/03/04/multi-gpu-training-in-pytorch-data-and-model-parallelism/资料来源: https ://glassboxmedicine.com/2020/03/04/multi-gpu-training-in-pytorch-data-and-model-parallelism/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM