Sagemaker - 分布式训练

Question

I can't find documentation on the behavior of Sagemaker when distributed training is not explicitly specified.当未明确指定分布式训练时，我找不到有关 Sagemaker 行为的文档。

Specifically,具体来说，

When SageMaker distributed data parallel is used via distribution='dataparallel', documents state that each instance processes different batches of data.当通过 distribution='dataparallel' 使用 SageMaker 分布式数据并行时，记录 state 每个实例处理不同批次的数据。

from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    role=role,
    py_version="py37",
    framework_version="2.4.1",
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=4,
    instance_type="ml.p3.16xlarge",
    sagemaker_session=sagemaker_session,
    # Training using SMDataParallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)

I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below我不确定当未指定分布参数但 instance_count > 1 时会发生什么，如下所示

estimator = TensorFlow(
    py_version="py3",
    entry_point="mnist.py",
    role=role,
    framework_version="1.12.0",
    instance_count=4,
    instance_type="ml.m4.xlarge",
)

Thanks!谢谢！

Answer 1

In the training code , when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.在训练代码中，初始化 smdataparallel 时会出现运行时错误 - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch。

The distribution parameters you pass in the estimator select the appropriate runner .您在估算器 select 中传递的分布参数是合适的跑步者。

Answer 2

"I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below" -> SageMaker will run your code on 4 machines. “我不确定当未指定分布参数但 instance_count > 1 时会发生什么，如下所示” -> SageMaker 将在 4 台机器上运行您的代码。 Unless you have code purpose-built for distributed computation this is useless (simple duplication).除非您有专门为分布式计算构建的代码，否则这是无用的（简单复制）。

It gets really interesting when:它在以下情况下变得非常有趣：

you parse resource configuration (resourceconfig.json or via env variables) so that each machine is aware of its rank in the cluster, and you can write custom arbitrary distributed things您解析资源配置（resourceconfig.json 或通过 env 变量），以便每台机器都知道它在集群中的排名，并且您可以编写自定义的任意分布式事物
if you run the same code over input that is ShardedByS3Key , your code will run on different parts of your S3 data that is homogeneously spread over machines.如果您在ShardedByS3Key的输入上运行相同的代码，您的代码将在均匀分布在机器上的 S3 数据的不同部分上运行。 Which makes SageMaker Training/Estimators a great place to run arbitrary shared-nothing distributed tasks such as file transformations and batch inference.这使得 SageMaker Training/Estimators 成为运行任意无共享分布式任务（例如文件转换和批量推理）的好地方。

Having machines clustered together also allows you to launch open-source distributed training software like PyTorch DDP将机器集群在一起还允许您启动开源分布式培训软件，如 PyTorch DDP

Sagemaker - 分布式训练

问题描述

2 个解决方案

解决方案1
0 2022-02-23 01:58:34

解决方案2
0 2022-02-25 13:31:45

Sagemaker - 分布式训练

问题描述

2 个解决方案

解决方案1 0 2022-02-23 01:58:34

解决方案2 0 2022-02-25 13:31:45

解决方案1
0 2022-02-23 01:58:34

解决方案2
0 2022-02-25 13:31:45