[英]Sagemaker - Distributed training
I can't find documentation on the behavior of Sagemaker when distributed training is not explicitly specified.当未明确指定分布式训练时,我找不到有关 Sagemaker 行为的文档。
Specifically,具体来说,
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
role=role,
py_version="py37",
framework_version="2.4.1",
# For training with multinode distributed training, set this count. Example: 2
instance_count=4,
instance_type="ml.p3.16xlarge",
sagemaker_session=sagemaker_session,
# Training using SMDataParallel Distributed Training Framework
distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)
estimator = TensorFlow(
py_version="py3",
entry_point="mnist.py",
role=role,
framework_version="1.12.0",
instance_count=4,
instance_type="ml.m4.xlarge",
)
Thanks!谢谢!
In the training code , when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.在训练代码中,初始化 smdataparallel 时会出现运行时错误 - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch。
The distribution parameters you pass in the estimator select the appropriate runner .您在估算器 select 中传递的分布参数是合适的跑步者。
"I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below" -> SageMaker will run your code on 4 machines. “我不确定当未指定分布参数但 instance_count > 1 时会发生什么,如下所示” -> SageMaker 将在 4 台机器上运行您的代码。 Unless you have code purpose-built for distributed computation this is useless (simple duplication).
除非您有专门为分布式计算构建的代码,否则这是无用的(简单复制)。
It gets really interesting when:它在以下情况下变得非常有趣:
ShardedByS3Key
, your code will run on different parts of your S3 data that is homogeneously spread over machines.ShardedByS3Key
的输入上运行相同的代码,您的代码将在均匀分布在机器上的 S3 数据的不同部分上运行。 Which makes SageMaker Training/Estimators a great place to run arbitrary shared-nothing distributed tasks such as file transformations and batch inference. Having machines clustered together also allows you to launch open-source distributed training software like PyTorch DDP将机器集群在一起还允许您启动开源分布式培训软件,如 PyTorch DDP
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.