修改現有 Pytorch 代碼以在多個 GPU 上運行

Question

我正在嘗試從以下鏈接在 2 個或更多 GPU 上運行 Pytoch UNet

我到現在所做的改變是：

1.來自：

net = UNet(n_channels=3, n_classes=1, bilinear=True)
logging.info(f'Network:\n'
             f'\t{net.module.n_channels} input channels\n'
             f'\t{net.module.n_classes} output channels (classes)\n'
             f'\t{"Bilinear" if net.module.bilinear else "Transposed conv"} upscaling')

到：

net = UNet(n_channels=3, n_classes=1, bilinear=True)
net = nn.DataParallel(net)
logging.info(f'Network:\n'
             f'\t{net.module.n_channels} input channels\n'
             f'\t{net.module.n_classes} output channels (classes)\n'
             f'\t{"Bilinear" if net.module.bilinear else "Transposed conv"} upscaling')

在每個地方：

net.<something>

替換為：

net.module.<something>

我知道 pytorch 看到的 GPU 多於 1 個，因為torch.cuda.device_count()返回

2

.

但是只要我嘗試運行需要比第一個 GPU 所擁有的更多內存的火車：

運行時錯誤：CUDA 內存不足。 嘗試分配 512.00 MiB（GPU 0；11.91 GiB 總容量；10.51 GiB 已分配；82.56 MiB 空閑；818.92 MiB 緩存）

我通過改變批量大小來改變訓練所需的內存。 歡迎任何幫助

編輯

我看到使用 2 個 GPU 訓練運行速度快兩倍，但使用單個 GPU 運行的最大批量大小與兩個 GPU 相同。 有沒有什么辦法可以在一次訓練中同時使用 2 個 GPU 的內存？

Answer 1

我的錯誤是將output = net(input) （通常命名為model ）更改為：

output = net.module(input)

你可以在這里找到信息

修改現有 Pytorch 代碼以在多個 GPU 上運行

問題描述

1 個解決方案

解決方案1
2 已采納 2020-10-03 23:52:33

修改現有 Pytorch 代碼以在多個 GPU 上運行

問題描述

1 個解決方案

解決方案1 2 已采納 2020-10-03 23:52:33

解決方案1
2 已采納 2020-10-03 23:52:33