简体繁体中英

Best way to run 1000s of training jobs on sagemaker

原文 2022-04-21 02:03:31 6 1 amazon-web-services/ amazon-sagemaker

I have thousands of training jobs that I want to run on sagemaker. Basically I have a list of hyperparameters and I want to train the model for all of those hyperparmeters in parallel (not a standard hyperparameter tuning where we just want to optimize the hyperparameter, here we want to train for all of the hyperparameters). I have searched the docs quite extensively but it surprises me that I couldn't find any info about this, even though it seems like a pretty basic functionality.

For example, let's say I have 10,000 training jobs, and my quota is 20 instances, what is the best way to run these jobs utilizing all my available instances? In particular,

Is there a "queue manager" functionality that takes the list of hyperparameters and runs the training jobs in batches of 20 until they are all done (even better if it could keep track of failed/completed jobs).
Is it best practice to run a single training job per instance? If that's the case do I need to ask for a much higher quota on the number of instance?
If this functionality does not exist in sagemaker, is it worth using EC2 instead since it's a bit cheaper?

1 answers

Your question is very broad and the best way forward would depend on other details of your use-case, so we will have to make some assumptions.

[Queue manager] SageMaker does not have a queue manager. If at the end you decide you need a queue manager, I would suggest looking towards AWS Batch.

[Single vs multiple training jobs] Since you need to run 10s of thousands job I assume you are training fairly lightweight models, so to save on time, you would be better off reusing instances for multiple training jobs. (Otherwise, with 20 instances limit, you need 500 rounds of training, with a 3 min start time - depending on instance type - you need 25 hours just for the wait time. Depending on the complexity of each individual model, this 25hours might be significant or totally acceptable).

[Instance limit increase] You can always ask for a limit increase, but going from a limit of 20 to 10k at once is likely that will not be accepted by the AWS support team, unless you are part of an organisation with a track record of usage on AWS, in which case this might be fine.

[One possible option] (Assuming multiple lightweight models) You could create a single training job, with instance count, the number of instances available to you. Inside the training job, your code can run a for loop and perform all the individual training jobs you need.

In this case, you will need to know which which instance is which so you can make the split of the HPOs. SageMaker writes this information on the file: /opt/ml/input/config/resourceconfig.json so using that you can easily have each instance run a subset of the trainings required.

Another thing to think of, is if you need to save the generated models (which you probably need). You can either save everything in the output model directory - standard SM approach- but this would zip all models in a model.tar.gz file. If you don't want this, and prefer to have each model individually saved, I'd suggest using the checkpoints directory that will sync anything written there to your s3 location.

AWS Sagemaker Multiple Training Jobs

how to run only training step in a sagemaker pipeline?

How to run sagemaker processing and training job?

Does sagemaker python sdk (training jobs) inherit all permissions from the edge node?

Sagemaker - Distributed training

Upload custom file to s3 from training script in training component of AWS SageMaker Pipeline

how to stop processing jobs sagemaker?

Trigger Amazon Sagemaker Processing Jobs

data format to predict with model fitted via Sagemaker's XGBoost built-in algorithm and training container

Best way to automate AWS EMR Creation,termination and pyspark jobs

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question AWS Sagemaker Multiple Training Jobs how to run only training step in a sagemaker pipeline? How to run sagemaker processing and training job? Does sagemaker python sdk (training jobs) inherit all permissions from the edge node? Sagemaker - Distributed training Upload custom file to s3 from training script in training component of AWS SageMaker Pipeline how to stop processing jobs sagemaker? Trigger Amazon Sagemaker Processing Jobs data format to predict with model fitted via Sagemaker's XGBoost built-in algorithm and training container Best way to automate AWS EMR Creation,termination and pyspark jobs

Related Tags

Best way to run 1000s of training jobs on sagemaker

Question

1 answers

solution1 1 ACCPTED 2022-04-22 08:01:42

solution1
1 ACCPTED 2022-04-22 08:01:42