簡體 English 中英

在 Amazon SageMaker 上使用 PyTorch DistributedDataParallel 和 Hugging Face

[英]Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker

原文 2022-09-08 09:03:22 5 1 pytorch/ amazon-sagemaker/ huggingface-transformers

即使對於單實例訓練，PyTorch DistributedDataParallel (DDP) 通常比 PyTorch DataParallel (DP) 更推薦，因為 DP 的策略性能較低，並且在默認設備上使用更多 ZCD69B4957F06CD818D7BF3D61980E2。 （根據這個 PyTorch 論壇主題）

Hugging Face 建議通過python -m torch.distributed.launch啟動器運行分布式訓練，因為他們的 Trainer API 支持 DDP，但如果不支持，則會退回到 DP。 （根據這個 HF 論壇主題）

I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory錯誤 - 基本上失去了所有的擴展優勢。

因此，對於 p3.16xl+ 來說，好消息是我可以啟用 SageMaker 分布式數據並行，PyToch DLC 將自動通過 torch.distributed 為我啟動。

對於具有較小工作負載或希望在擴展之前進行測試的用例來說，壞消息是 SMDistributed 不支持所有多 GPU 實例類型。 例如，沒有 p3.8xl 或 g 系列。 我確實嘗試手動設置sagemaker_distributed_dataparallel_enabled環境變量，但沒有任何樂趣。

那么我們如何在 SageMaker 上使用 PyTorch DDP 啟動 HF Trainer 腳本呢？

1 個解決方案

好問題，感謝您的提問，PyTorch DDP 在多個進程中運行數據並行工作者。 必須由開發人員啟動和管理，DDP 應該被視為一個托管的 allreduce，而不是一個托管的數據並行庫。 因為它需要您啟動和管理工作人員，甚至為工作人員分配資源：為了在 SageMaker 培訓作業中啟動 DDP 流程，您有很多選擇：

如果你做多GPU，單機，你可以使用torch.multiprocessing.spawn ，如這個官方PyTorch演示所示（順便說一下）
如果你做多 GPU 單機，你也可以使用Ray Train庫來啟動這些進程。 我可以在筆記本中使用它，但還不能在 DLC 中使用（最近的庫學習和制作有點粗糙，請在此處查看我的所有問題）。 Ray Train 也應該在多節點上工作。
如果您使用多 GPU、任何機器，則可以使用torch.distributed.launch ，它包含在 shell 或 Python 的啟動器腳本中。 此處示例https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
您還可以使用 SageMaker MPI 集成而不是torch.distributed來啟動這些流程。 不幸的是，我們沒有為此創建文檔，所以沒有人使用它，也沒有人推銷它。 但它看起來很酷，因為它允許直接在 EC2 機器中運行腳本的副本，而無需調用中間 PyTorch 啟動器。 這里的例子

所以現在，我的建議是 go 路線（3），這是最接近 PyTorch 社區所做的，因此提供了更容易的開發和調試路徑。

備注：

PyTorch DDP 發展迅速。 在 PT 1.10 中， torch.distributed被torchrun取代，並且正在創建一個torchX工具來...簡化事情。）。
不必管理這些混亂是 SageMaker 分布式數據並行是一個物超所值的一個原因：您只需要編輯腳本，SM 服務會處理進程創建。 不幸的是，正如您所指出的，SMDP 僅限於 P3 和 P4 培訓工作，嚴重限制了它的使用。
以下是將單 GPU 代碼更改為多機器代碼的重要 PT DDP 概念
- 與代表您處理工作負載分區的 Apache Spark 不同，Pytorch 分布式訓練要求用戶將特定的工作分配給特定的 GPU。 在下一節中，我們假設我們在 GPU 上進行訓練。
- 在 PyTorch DDP 中，每個 GPU 運行您的訓練代碼的自定義副本。 在 GPU 上運行的訓練代碼副本通常稱為rank 、 data parallel replica 、 process 、 worker ，但可能存在其他名稱。
- 要讓 PyTorch DDP 在分布在 M 台機器上的 MxN GPU 上啟動訓練集群，您必須向 PyTorch DDP 指定您擁有的機器數量和每台機器要啟動的進程數量。 這分別由torch.distributed.launch實用程序的參數-nnodes和-nproc_per_node完成。 您必須在訓練集群的每個節點上運行一次torch.distributed.lauch 。 您可以使用多種工具來實現此並行命令，例如上面提到的 MPI 或 SageMaker 訓練。 為了建立必要的握手並形成集群，您還必須在torch.distributed.launch命令中指定-node_rank ，它必須在每台機器上采用 0 到 N-1 之間的唯一機器 ID，以及-master_addr和-master_port ，如果您運行單機集群，則可選，所有機器必須相同。
- 在從每個數據並行副本腳本中運行的init_process_group DDP 初始化方法中，您必須分別使用world_size和rank參數指定世界大小和副本 ID。 因此，您必須有辦法向每個腳本傳達一個唯一 ID，通常稱為global rank 。 全球排名可以幫助您個性化每個 GPU 所做的工作，例如僅從一張卡保存 model，或僅在一張卡中運行驗證。 在由 3 台機器組成的集群中，每台機器有 4 個 GPU，全局等級的范圍是 0 到 11。在一台機器中，為了將 DDP 數據並行副本分配給可用的 GPU，必須為每個副本中運行的腳本分配一個 GPU ID，在運行它的機器中是獨一無二的。 這稱為本地排名，可以通過 PyTorch DDP torch.distributed.launch設置為參數。 在由 3 台機器組成的集群中，每台機器有 4 個 GPU，在每台機器上，DDP 進程的本地等級范圍為 0 到 3

刪除下載的 tensorflow 和 pytorch(Hugging face) 模型

[英]Remove downloaded tensorflow and pytorch(Hugging face) models

使用 SageMaker Pytorch 圖像進行訓練

[英]Use SageMaker Pytorch image for training

如何在沒有預訓練 model 的情況下使用 T5 架構（擁抱臉）

[英]How to use architecture of T5 without pretrained model (Hugging face)

AttributeError: 'str' object 沒有屬性 'shape' 使用 BertModel 和 PyTorch 編碼張量（擁抱臉）

[英]AttributeError: 'str' object has no attribute 'shape' while encoding tensor using BertModel with PyTorch (Hugging Face)

擁抱面（pytorch 變壓器）上的 GPT2 運行時錯誤：只能為標量輸出隱式創建 grad

[英]GPT2 on Hugging face(pytorch transformers) RuntimeError: grad can be implicitly created only for scalar outputs

如何使用 pytorch 保存和加載自定義擁抱面 model，包括 config.json 文件

[英]How to save and load the custom Hugging face model including config.json file using pytorch

擁抱面孔 - PyTorch RuntimeError：nll_loss_forward_reduce_cuda_kernel_2d_index 未為 Int 實現

[英]Hugging Face - PyTorch RuntimeError : nll_loss_forward_reduce_cuda_kernel_2d_index not implemented for Int

如何使用Amazon Sagemaker pytorch estimator處理嵌套在文件夾中的入口點？

[英]How to handle entrypoints nested in folders with amazon sagemaker pytorch estimator?

在 PyTorch 中指定 GPU 設備 ID 的 DistributedDataParallel

[英]DistributedDataParallel with gpu device ID specified in PyTorch

具有不同 GPU 速度的 PyTorch DistributedDataParallel 是否同步權重？

[英]Is PyTorch DistributedDataParallel with different GPU speeds syncing weights?

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 刪除下載的 tensorflow 和 pytorch(Hugging face) 模型使用 SageMaker Pytorch 圖像進行訓練如何在沒有預訓練 model 的情況下使用 T5 架構（擁抱臉） AttributeError: 'str' object 沒有屬性 'shape' 使用 BertModel 和 PyTorch 編碼張量（擁抱臉）擁抱面（pytorch 變壓器）上的 GPT2 運行時錯誤：只能為標量輸出隱式創建 grad 如何使用 pytorch 保存和加載自定義擁抱面 model，包括 config.json 文件擁抱面孔 - PyTorch RuntimeError：nll_loss_forward_reduce_cuda_kernel_2d_index 未為 Int 實現如何使用Amazon Sagemaker pytorch estimator處理嵌套在文件夾中的入口點？在 PyTorch 中指定 GPU 設備 ID 的 DistributedDataParallel 具有不同 GPU 速度的 PyTorch DistributedDataParallel 是否同步權重？

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM