[英]Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker
Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device.即使对于单实例训练,PyTorch DistributedDataParallel (DDP) 通常比 PyTorch DataParallel (DP) 更推荐,因为 DP 的策略性能较低,并且在默认设备上使用更多 ZCD69B4957F06CD818D7BF3D61980E2。 (Per this PyTorch forums thread ) (根据这个 PyTorch 论坛主题)
Hugging Face recommend to run distributed training via the python -m torch.distributed.launch
launcher, because their Trainer API supports DDP but will fall back to DP if you don't. Hugging Face 建议通过python -m torch.distributed.launch
启动器运行分布式训练,因为他们的 Trainer API 支持 DDP,但如果不支持,则会退回到 DP。 (Per this HF forums thread ) (根据这个 HF 论坛主题)
I recently ran in to this problem: scaling a HF training job from p3.8xlarge
to p3.16xlarge
increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage. I recently ran in to this problem: scaling a HF training job from p3.8xlarge
to p3.16xlarge
increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory错误 - 基本上失去了所有的扩展优势。
So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me .因此,对于 p3.16xl+ 来说,好消息是我可以启用 SageMaker 分布式数据并行,PyToch DLC 将自动通过 torch.distributed 为我启动。
The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types .对于具有较小工作负载或希望在扩展之前进行测试的用例来说,坏消息是 SMDistributed 不支持所有多 GPU 实例类型。 No p3.8xl or g series, for example.例如,没有 p3.8xl 或 g 系列。 I did try manually setting the sagemaker_distributed_dataparallel_enabled
environment variable, but no joy.我确实尝试手动设置sagemaker_distributed_dataparallel_enabled
环境变量,但没有任何乐趣。
So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?那么我们如何在 SageMaker 上使用 PyTorch DDP 启动 HF Trainer 脚本呢?
Great question, thanks for asking, PyTorch DDP runs data parallel workers in multiple processes.好问题,感谢您的提问,PyTorch DDP 在多个进程中运行数据并行工作者。 that must be launched and managed by developers, DDP should be seen as a managed allreduce, more than a managed data-parallelism library.必须由开发人员启动和管理,DDP 应该被视为一个托管的 allreduce,而不是一个托管的数据并行库。 since it requires you to launch and manage the workers and even assigning resources to workers: In order to launch the DDP processes in a SageMaker Training job you have many options:因为它需要您启动和管理工作人员,甚至为工作人员分配资源:为了在 SageMaker 培训作业中启动 DDP 流程,您有很多选择:
torch.multiprocessing.spawn
, as shown in this official PyTorch demo (that is broken by the way)如果你做多GPU,单机,你可以使用torch.multiprocessing.spawn
,如这个官方PyTorch演示所示(顺便说一下)torch.distributed.launch
, wrapped in a launcher script in shell or Python.如果您使用多 GPU、任何机器,则可以使用torch.distributed.launch
,它包含在 shell 或 Python 的启动器脚本中。 Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb此处示例https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynbtorch.distributed
.您还可以使用 SageMaker MPI 集成而不是torch.distributed
来启动这些流程。 Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it.不幸的是,我们没有为此创建文档,所以没有人使用它,也没有人推销它。 But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher.但它看起来很酷,因为它允许直接在 EC2 机器中运行脚本的副本,而无需调用中间 PyTorch 启动器。 Example here 这里的例子So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.所以现在,我的建议是 go 路线(3),这是最接近 PyTorch 社区所做的,因此提供了更容易的开发和调试路径。
Notes :备注:
torch.distributed
is replaced by torchrun
, and a torchX tool is being created to...simplify things.).在 PT 1.10 中, torch.distributed
被torchrun
取代,并且正在创建一个torchX工具来...简化事情。)。-nnodes
and -nproc_per_node
of the torch.distributed.launch
utility.这分别由torch.distributed.launch
实用程序的参数-nnodes
和-nproc_per_node
完成。 You must run the torch.distributed.lauch
once on each node of the training cluster.您必须在训练集群的每个节点上运行一次torch.distributed.lauch
。 You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above.您可以使用多种工具来实现此并行命令,例如上面提到的 MPI 或 SageMaker 训练。 In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch
command -node_rank
, which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr
and -master_port
, optional if you run a single-machine cluster, which must be the same across all machines.为了建立必要的握手并形成集群,您还必须在torch.distributed.launch
命令中指定-node_rank
,它必须在每台机器上采用 0 到 N-1 之间的唯一机器 ID,以及-master_addr
和-master_port
,如果您运行单机集群,则可选,所有机器必须相同。init_process_group
DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size
and rank
parameters.在从每个数据并行副本脚本中运行的init_process_group
DDP 初始化方法中,您必须分别使用world_size
和rank
参数指定世界大小和副本 ID。 Hence you must have a way to communicate to each script a unique ID, generally called the global rank .因此,您必须有办法向每个脚本传达一个唯一 ID,通常称为global rank 。 The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card.全球排名可以帮助您个性化每个 GPU 所做的工作,例如仅从一张卡保存 model,或仅在一张卡中运行验证。 In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on.在由 3 台机器组成的集群中,每台机器有 4 个 GPU,全局等级的范围是 0 到 11。在一台机器中,为了将 DDP 数据并行副本分配给可用的 GPU,必须为每个副本中运行的脚本分配一个 GPU ID,在运行它的机器中是独一无二的。 This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch
.这称为本地排名,可以通过 PyTorch DDP torch.distributed.launch
设置为参数。 In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3在由 3 台机器组成的集群中,每台机器有 4 个 GPU,在每台机器上,DDP 进程的本地等级范围为 0 到 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.