简体   繁体   中英

How to prepare docker image for SageMaker training job

Let's say I have docker image which has the training code for Machine Learning model. How can I adapt it for SageMaker Training Job so I can run the docker image there?

There are several things to keep in mind when adapting docker image for SageMaker Training Job:

  1. Training code

Training code should be located in /opt/ml/code/ in the docker image and the main script should be /opt/ml/code/train . Also, the script should have permissions for executing it ( chmod 777 /opt/ml/code/train does the trick). Also not critical, but useful - if you need to do any imports you may need to add code to the path export PATH="/opt/ml/code:${PATH}" . By default SageMaker runs for training docker run image train , but alternatively you can use a custom entrypoint so that you can keep the original path for the training code.

  1. Hyperparameters

Hyperparameters which are provided as part of HyperParameters parameters are saved as json file in /opt/ml/input/config/hyperparameters.json so your training code has to read them from there. Keep in mind that it only supports string fields and has the following limits:

  • Map Entries: Minimum number of 0 items. Maximum number of 100 items.
  • Key Length Constraints: Maximum length of 256.
  • Value Length Constraints: Maximum length of 2500.
  1. Output

To save the model, it should be put all files into /opt/ml/model/ folder - SageMaker compresses all the files here into the TAR format and saves it to S3 in the location specified in the training job.

  1. [Optional] Metrics

Metrics are gathered by running regular expressions on the logs produced by the container. These are defined in metrics definition of the training job where you can provide regular expressions for existing logs (eg loss after each epoch with val_loss: (.*) ). The minimal thing to enable metric-friendly log is the following - Log f"Value = {value}" , corresponding regex Value = (.*) , corresponding metric definition {"Name": "value", "Regex": "Value = (.*)"} .

Important documentation:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM