[英]Process stuck when training on multiple nodes using PyTorch DistributedDataParallel
I am trying to run the script mnist-distributed.py
from Distributed data parallel training in Pytorch .我正在尝试从
mnist-distributed.py
的分布式数据并行训练运行脚本mnist-distributed.py 。 I have also pasted the same code here.我也在这里粘贴了相同的代码。 (I have replaced my actual
MASTER_ADDR
with abcd
for posting here). (我已经用
abcd
替换了我的实际MASTER_ADDR
以便在此处发布)。
import os
import argparse
import torch.multiprocessing as mp
import torchvision
import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torch.distributed as dist
class ConvNet(nn.Module):
def __init__(self, num_classes=10):
super(ConvNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.layer2 = nn.Sequential(
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.fc = nn.Linear(7*7*32, num_classes)
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
return out
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')
parser.add_argument('-g', '--gpus', default=1, type=int,
help='number of gpus per node')
parser.add_argument('-nr', '--nr', default=0, type=int,
help='ranking within the nodes')
parser.add_argument('--epochs', default=2, type=int, metavar='N',
help='number of total epochs to run')
args = parser.parse_args()
args.world_size = args.gpus * args.nodes
os.environ['MASTER_ADDR'] = 'a.b.c.d'
os.environ['MASTER_PORT'] = '8890'
mp.spawn(train, nprocs=args.gpus, args=(args,))
def train(gpu, args):
rank = args.nr * args.gpus + gpu
dist.init_process_group(
backend='nccl',
init_method='env://',
world_size=args.world_size,
rank=rank
)
torch.manual_seed(0)
model = ConvNet()
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = 100
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
# Wrap the model
model = nn.parallel.DistributedDataParallel(model,
device_ids=[gpu])
# Data loading code
train_dataset = torchvision.datasets.MNIST(
root='./data',
train=True,
transform=transforms.ToTensor(),
download=True
)
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset,
num_replicas=args.world_size,
rank=rank
)
train_loader = torch.utils.data.DataLoader(
dataset=train_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=0,
pin_memory=True,
sampler=train_sampler)
total_step = len(train_loader)
for epoch in range(args.epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0 and gpu == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(
epoch + 1,
args.epochs,
i + 1,
total_step,
loss.item())
)
if __name__ == '__main__':
main()
There are 2 nodes with 2 GPUs each.有 2 个节点,每个节点有 2 个 GPU。 I run this command from the terminal of the master node-
我从主节点的终端运行此命令-
python mnist-distributed.py -n 2 -g 2 -nr 0
python mnist-distributed.py -n 2 -g 2 -nr 0
, and then this from the terminal of the other node- ,然后这是来自另一个节点的终端-
python mnist-distributed.py -n 2 -g 2 -nr 1
python mnist-distributed.py -n 2 -g 2 -nr 1
But then my process gets stuck with no output on either terminal.但是随后我的进程卡住了,两个终端上都没有输出。
Running the same code on a single node using the following command works perfectly fine-使用以下命令在单个节点上运行相同的代码效果很好-
python mnist-distributed.py -n 1 -g 2 -nr 0
python mnist-distributed.py -n 1 -g 2 -nr 0
I met a similar problem.我遇到了类似的问题。 And the problem is solved by
问题解决了
sudo vi /etc/default/grub
Edit it:编辑它:
#GRUB_CMDLINE_LINUX="" <----- Original commented
GRUB_CMDLINE_LINUX="iommu=soft" <------ Change
sudo update-grub
Reboot to see the change.重新启动以查看更改。
Ref: https://github.com/pytorch/pytorch/issues/1637#issuecomment-338268158参考: https : //github.com/pytorch/pytorch/issues/1637#issuecomment-338268158
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.