简体繁体 English

在 Amazon SageMaker 上使用 PyTorch DistributedDataParallel 和 Hugging Face

[英]Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker

原文 2022-09-08 09:03:22 6 1 pytorch/ amazon-sagemaker/ huggingface-transformers

Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device.即使对于单实例训练，PyTorch DistributedDataParallel (DDP) 通常比 PyTorch DataParallel (DP) 更推荐，因为 DP 的策略性能较低，并且在默认设备上使用更多 ZCD69B4957F06CD818D7BF3D61980E2。 (Per this PyTorch forums thread ) （根据这个 PyTorch 论坛主题）

Hugging Face recommend to run distributed training via the python -m torch.distributed.launch launcher, because their Trainer API supports DDP but will fall back to DP if you don't. Hugging Face 建议通过python -m torch.distributed.launch启动器运行分布式训练，因为他们的 Trainer API 支持 DDP，但如果不支持，则会退回到 DP。 (Per this HF forums thread ) （根据这个 HF 论坛主题）

I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage. I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory错误 - 基本上失去了所有的扩展优势。

So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me .因此，对于 p3.16xl+ 来说，好消息是我可以启用 SageMaker 分布式数据并行，PyToch DLC 将自动通过 torch.distributed 为我启动。

The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types .对于具有较小工作负载或希望在扩展之前进行测试的用例来说，坏消息是 SMDistributed 不支持所有多 GPU 实例类型。 No p3.8xl or g series, for example.例如，没有 p3.8xl 或 g 系列。 I did try manually setting the sagemaker_distributed_dataparallel_enabled environment variable, but no joy.我确实尝试手动设置sagemaker_distributed_dataparallel_enabled环境变量，但没有任何乐趣。

So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?那么我们如何在 SageMaker 上使用 PyTorch DDP 启动 HF Trainer 脚本呢？

1 个解决方案

Great question, thanks for asking, PyTorch DDP runs data parallel workers in multiple processes.好问题，感谢您的提问，PyTorch DDP 在多个进程中运行数据并行工作者。 that must be launched and managed by developers, DDP should be seen as a managed allreduce, more than a managed data-parallelism library.必须由开发人员启动和管理，DDP 应该被视为一个托管的 allreduce，而不是一个托管的数据并行库。 since it requires you to launch and manage the workers and even assigning resources to workers: In order to launch the DDP processes in a SageMaker Training job you have many options:因为它需要您启动和管理工作人员，甚至为工作人员分配资源：为了在 SageMaker 培训作业中启动 DDP 流程，您有很多选择：

If you do multi-GPU, single-machine, you can use torch.multiprocessing.spawn , as shown in this official PyTorch demo (that is broken by the way)如果你做多GPU，单机，你可以使用torch.multiprocessing.spawn ，如这个官方PyTorch演示所示（顺便说一下）
If you do multi-GPU, single-machine, you can also use the Ray Train library to launch those processes.如果你做多 GPU 单机，你也可以使用Ray Train库来启动这些进程。 I was able to use it in a Notebook, but not in the DLC yet (recent library that is a bit rough to learn and make work, see all my issues here ).我可以在笔记本中使用它，但还不能在 DLC 中使用（最近的库学习和制作有点粗糙，请在此处查看我的所有问题）。 Ray Train should work on multi-node too. Ray Train 也应该在多节点上工作。
If you do multi-GPU, any-machine, you can use torch.distributed.launch , wrapped in a launcher script in shell or Python.如果您使用多 GPU、任何机器，则可以使用torch.distributed.launch ，它包含在 shell 或 Python 的启动器脚本中。 Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb此处示例https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
You can also launch those processes with the SageMaker MPI integration instead of torch.distributed .您还可以使用 SageMaker MPI 集成而不是torch.distributed来启动这些流程。 Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it.不幸的是，我们没有为此创建文档，所以没有人使用它，也没有人推销它。 But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher.但它看起来很酷，因为它允许直接在 EC2 机器中运行脚本的副本，而无需调用中间 PyTorch 启动器。 Example here 这里的例子

So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.所以现在，我的建议是 go 路线（3），这是最接近 PyTorch 社区所做的，因此提供了更容易的开发和调试路径。

Notes :备注：

PyTorch DDP evolves fast. PyTorch DDP 发展迅速。 In PT 1.10 torch.distributed is replaced by torchrun , and a torchX tool is being created to...simplify things.).在 PT 1.10 中， torch.distributed被torchrun取代，并且正在创建一个torchX工具来...简化事情。）。
Not having to manage that mess is a reason why SageMaker Distributed Data Parallel is a great value prop: you only need to edit your script, and the SM service handles process creation.不必管理这些混乱是 SageMaker 分布式数据并行是一个物超所值的一个原因：您只需要编辑脚本，SM 服务会处理进程创建。 Unfortunately, as you point out, SMDP being limited to P3 and P4 training jobs seriously limits its use.不幸的是，正如您所指出的，SMDP 仅限于 P3 和 P4 培训工作，严重限制了它的使用。
Below are important PT DDP concepts to understand to alter single-GPU code into multi-machine code以下是将单 GPU 代码更改为多机器代码的重要 PT DDP 概念
- Unlike Apache Spark, which takes care of workload partitioning on your behalf, Pytorch distributed training requires the user to assign specific pieces of work to specific GPUs.与代表您处理工作负载分区的 Apache Spark 不同，Pytorch 分布式训练要求用户将特定的工作分配给特定的 GPU。 In the following section, we assume that we train on GPU.在下一节中，我们假设我们在 GPU 上进行训练。
- In PyTorch DDP, each GPU runs a customized copy of you training code.在 PyTorch DDP 中，每个 GPU 运行您的训练代码的自定义副本。 A copy of the training code running on one GPU is generally called a rank , a data parallel replica , a process , a worker , but other names may exist.在 GPU 上运行的训练代码副本通常称为rank 、 data parallel replica 、 process 、 worker ，但可能存在其他名称。
- For PyTorch DDP to launch a training cluster on the MxN GPUs spread over your M machines, you must specify to PyTorch DDP the number of machines you have and the number of processes to launch per machine.要让 PyTorch DDP 在分布在 M 台机器上的 MxN GPU 上启动训练集群，您必须向 PyTorch DDP 指定您拥有的机器数量和每台机器要启动的进程数量。 This is respectively done by the parameters -nnodes and -nproc_per_node of the torch.distributed.launch utility.这分别由torch.distributed.launch实用程序的参数-nnodes和-nproc_per_node完成。 You must run the torch.distributed.lauch once on each node of the training cluster.您必须在训练集群的每个节点上运行一次torch.distributed.lauch 。 You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above.您可以使用多种工具来实现此并行命令，例如上面提到的 MPI 或 SageMaker 训练。 In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch command -node_rank , which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr and -master_port , optional if you run a single-machine cluster, which must be the same across all machines.为了建立必要的握手并形成集群，您还必须在torch.distributed.launch命令中指定-node_rank ，它必须在每台机器上采用 0 到 N-1 之间的唯一机器 ID，以及-master_addr和-master_port ，如果您运行单机集群，则可选，所有机器必须相同。
- In the init_process_group DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size and rank parameters.在从每个数据并行副本脚本中运行的init_process_group DDP 初始化方法中，您必须分别使用world_size和rank参数指定世界大小和副本 ID。 Hence you must have a way to communicate to each script a unique ID, generally called the global rank .因此，您必须有办法向每个脚本传达一个唯一 ID，通常称为global rank 。 The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card.全球排名可以帮助您个性化每个 GPU 所做的工作，例如仅从一张卡保存 model，或仅在一张卡中运行验证。 In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on.在由 3 台机器组成的集群中，每台机器有 4 个 GPU，全局等级的范围是 0 到 11。在一台机器中，为了将 DDP 数据并行副本分配给可用的 GPU，必须为每个副本中运行的脚本分配一个 GPU ID，在运行它的机器中是独一无二的。 This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch .这称为本地排名，可以通过 PyTorch DDP torch.distributed.launch设置为参数。 In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3在由 3 台机器组成的集群中，每台机器有 4 个 GPU，在每台机器上，DDP 进程的本地等级范围为 0 到 3