简体繁体中英

How to prepare docker image for SageMaker training job

原文 2022-11-15 12:13:41 6 1 amazon-web-services/ machine-learning/ amazon-sagemaker

Let's say I have docker image which has the training code for Machine Learning model. How can I adapt it for SageMaker Training Job so I can run the docker image there?

1 answers

There are several things to keep in mind when adapting docker image for SageMaker Training Job:

Training code

Training code should be located in /opt/ml/code/ in the docker image and the main script should be /opt/ml/code/train . Also, the script should have permissions for executing it ( chmod 777 /opt/ml/code/train does the trick). Also not critical, but useful - if you need to do any imports you may need to add code to the path export PATH="/opt/ml/code:${PATH}" . By default SageMaker runs for training docker run image train , but alternatively you can use a custom entrypoint so that you can keep the original path for the training code.

Hyperparameters

Hyperparameters which are provided as part of HyperParameters parameters are saved as json file in /opt/ml/input/config/hyperparameters.json so your training code has to read them from there. Keep in mind that it only supports string fields and has the following limits:

Map Entries: Minimum number of 0 items. Maximum number of 100 items.
Key Length Constraints: Maximum length of 256.
Value Length Constraints: Maximum length of 2500.

Output

To save the model, it should be put all files into /opt/ml/model/ folder - SageMaker compresses all the files here into the TAR format and saves it to S3 in the location specified in the training job.

[Optional] Metrics

Metrics are gathered by running regular expressions on the logs produced by the container. These are defined in metrics definition of the training job where you can provide regular expressions for existing logs (eg loss after each epoch with val_loss: (.*) ). The minimal thing to enable metric-friendly log is the following - Log f"Value = {value}" , corresponding regex Value = (.*) , corresponding metric definition {"Name": "value", "Regex": "Value = (.*)"} .

Important documentation:

Container structure - https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-toolkits.html
How the container is started - https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html
Output - https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html

How can I resume a training job in Sagemaker script mode?

Initiating sagemaker training job from aws glue job

how to run only training step in a sagemaker pipeline?

AWS Sagemaker - Custom Training Job not saving Model output

Specify checkpoint path in custom docker image in SageMaker

How to alter shared memory for SageMaker Docker containers?

AWS SageMaker training script: how to pass custom user parameters

How to run a Python Google Cloud Dataflow job with a custom Docker image?

How to run AWS Sagemaker Studio job based on pre defined schedule

How to parse stepfunction executionId to SageMaker batch transform job name?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How can I resume a training job in Sagemaker script mode? Initiating sagemaker training job from aws glue job how to run only training step in a sagemaker pipeline? AWS Sagemaker - Custom Training Job not saving Model output Specify checkpoint path in custom docker image in SageMaker How to alter shared memory for SageMaker Docker containers? AWS SageMaker training script: how to pass custom user parameters How to run a Python Google Cloud Dataflow job with a custom Docker image? How to run AWS Sagemaker Studio job based on pre defined schedule How to parse stepfunction executionId to SageMaker batch transform job name?

Related Tags

How to prepare docker image for SageMaker training job

Question

1 answers

solution1 0 2022-11-15 12:13:41

solution1
0 2022-11-15 12:13:41