如何在pytorch中用多個GPU訓練model？

Question

我的服務器有兩個 GPU，如何同時使用兩個 GPU 進行訓練以最大化它們的計算能力？ 我下面的代碼正確嗎？ 它是否允許我的 model 得到適當的培訓？

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.bert = pretrained_model
        # for param in self.bert.parameters():
        #     param.requires_grad = True
        self.linear = nn.Linear(2048, 4)


    #def forward(self, input_ids, token_type_ids, attention_mask):
    def forward(self, input_ids, attention_mask):
        batch = input_ids.size(0)
        #output = self.bert(input_ids, token_type_ids, attention_mask).pooler_output
        output = self.bert(input_ids, attention_mask).last_hidden_state
        print('last_hidden_state',output.shape) # torch.Size([1, 768]) 
        #output = output.view(batch, -1) #
        output = output[:,-1,:]#(batch_size, hidden_size*2)(batch_size,1024)
        output = self.linear(output)
        return output

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.device_count() > 1:
    print("Use", torch.cuda.device_count(), 'gpus')
    model = MyModel()
    model = nn.DataParallel(model)
    model = model.to(device)

Answer 1

在多個 GPU 上訓練有兩種不同的方法：

Data Parallelism = splitting a large batch that can't fit into a single GPU memory into multiple GPUs, so every GPU will process a small batch that can fit into its GPU
Model 並行 = 將 model 中的層拆分到不同的設備中管理和處理有點棘手。

請參閱此帖子以獲取更多信息

要在純 PyTorch 中進行數據並行化，請參考我創建的這個示例，該示例回溯到 PyTorch 的最新更改（截至今天，1.12）。

要利用其他庫進行多 GPU 訓練而無需設計很多東西，我建議使用PyTorch Lightning ，因為它有一個簡單的 API 和良好的文檔來學習如何使用數據並行進行多 GPU 訓練。

Answer 2

我使用數據並行。 我參考這個鏈接。 這是一個有用的參考。https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

如何在pytorch中用多個GPU訓練model？

問題描述

2 個解決方案

解決方案1
1 2022-08-07 15:36:45

解決方案2
0 2022-08-08 01:58:12

如何在pytorch中用多個GPU訓練model？

問題描述

2 個解決方案

解決方案1 1 2022-08-07 15:36:45

解決方案2 0 2022-08-08 01:58:12

解決方案1
1 2022-08-07 15:36:45

解決方案2
0 2022-08-08 01:58:12