简体繁体 English

如何为 SageMaker 训练作业准备 docker 图像

[英]How to prepare docker image for SageMaker training job

原文 2022-11-15 12:13:41 4 1 amazon-web-services/ machine-learning/ amazon-sagemaker

Let's say I have docker image which has the training code for Machine Learning model. How can I adapt it for SageMaker Training Job so I can run the docker image there?假设我有 docker 图像，其中包含机器学习的训练代码 model。我如何调整它以适应 SageMaker 训练作业，以便我可以在那里运行 docker 图像？

1 个解决方案

There are several things to keep in mind when adapting docker image for SageMaker Training Job:在为 SageMaker 训练作业调整 docker 图像时，需要注意以下几点：

Training code训练代码

Training code should be located in /opt/ml/code/ in the docker image and the main script should be /opt/ml/code/train .训练代码应位于 docker 图像中的/opt/ml/code/中，主脚本应为/opt/ml/code/train 。 Also, the script should have permissions for executing it ( chmod 777 /opt/ml/code/train does the trick).此外，该脚本应该具有执行它的权限（ chmod 777 /opt/ml/code/train可以解决问题）。 Also not critical, but useful - if you need to do any imports you may need to add code to the path export PATH="/opt/ml/code:${PATH}" .同样不重要，但很有用 - 如果您需要进行任何导入，您可能需要将代码添加到路径export PATH="/opt/ml/code:${PATH}" 。 By default SageMaker runs for training docker run image train , but alternatively you can use a custom entrypoint so that you can keep the original path for the training code.默认情况下，SageMaker 运行训练docker run image train ，但您也可以使用自定义入口点，以便保留训练代码的原始路径。

Hyperparameters超参数

Hyperparameters which are provided as part of HyperParameters parameters are saved as json file in /opt/ml/input/config/hyperparameters.json so your training code has to read them from there.作为HyperParameters参数的一部分提供的超参数在/opt/ml/input/config/hyperparameters.json中保存为 json 文件，因此您的训练代码必须从那里读取它们。 Keep in mind that it only supports string fields and has the following limits:请记住，它仅支持字符串字段并具有以下限制：

Map Entries: Minimum number of 0 items. Map 条目：最少 0 个条目。 Maximum number of 100 items.最多 100 个项目。
Key Length Constraints: Maximum length of 256.密钥长度限制：最大长度为 256。
Value Length Constraints: Maximum length of 2500.值长度限制：最大长度为 2500。

Output Output

To save the model, it should be put all files into /opt/ml/model/ folder - SageMaker compresses all the files here into the TAR format and saves it to S3 in the location specified in the training job.要保存 model，应将所有文件放入/opt/ml/model/文件夹 - SageMaker 将此处的所有文件压缩为 TAR 格式并将其保存到训练作业中指定位置的 S3。

[Optional] Metrics [可选] 指标

Metrics are gathered by running regular expressions on the logs produced by the container.通过在容器生成的日志上运行正则表达式来收集指标。 These are defined in metrics definition of the training job where you can provide regular expressions for existing logs (eg loss after each epoch with val_loss: (.*) ).这些在训练作业的指标定义中定义，您可以在其中为现有日志提供正则表达式（例如，每个时期后的损失与val_loss: (.*) ）。 The minimal thing to enable metric-friendly log is the following - Log f"Value = {value}" , corresponding regex Value = (.*) , corresponding metric definition {"Name": "value", "Regex": "Value = (.*)"} .启用度量友好日志的最少内容如下 - Log f"Value = {value}" ，对应的正则表达式Value = (.*) ，对应的度量定义{"Name": "value", "Regex": "Value = (.*)"} 。