简体   繁体   中英

Best way to run 1000s of training jobs on sagemaker

I have thousands of training jobs that I want to run on sagemaker. Basically I have a list of hyperparameters and I want to train the model for all of those hyperparmeters in parallel (not a standard hyperparameter tuning where we just want to optimize the hyperparameter, here we want to train for all of the hyperparameters). I have searched the docs quite extensively but it surprises me that I couldn't find any info about this, even though it seems like a pretty basic functionality.

For example, let's say I have 10,000 training jobs, and my quota is 20 instances, what is the best way to run these jobs utilizing all my available instances? In particular,

  • Is there a "queue manager" functionality that takes the list of hyperparameters and runs the training jobs in batches of 20 until they are all done (even better if it could keep track of failed/completed jobs).
  • Is it best practice to run a single training job per instance? If that's the case do I need to ask for a much higher quota on the number of instance?
  • If this functionality does not exist in sagemaker, is it worth using EC2 instead since it's a bit cheaper?

Your question is very broad and the best way forward would depend on other details of your use-case, so we will have to make some assumptions.

[Queue manager] SageMaker does not have a queue manager. If at the end you decide you need a queue manager, I would suggest looking towards AWS Batch.

[Single vs multiple training jobs] Since you need to run 10s of thousands job I assume you are training fairly lightweight models, so to save on time, you would be better off reusing instances for multiple training jobs. (Otherwise, with 20 instances limit, you need 500 rounds of training, with a 3 min start time - depending on instance type - you need 25 hours just for the wait time. Depending on the complexity of each individual model, this 25hours might be significant or totally acceptable).

[Instance limit increase] You can always ask for a limit increase, but going from a limit of 20 to 10k at once is likely that will not be accepted by the AWS support team, unless you are part of an organisation with a track record of usage on AWS, in which case this might be fine.

[One possible option] (Assuming multiple lightweight models) You could create a single training job, with instance count, the number of instances available to you. Inside the training job, your code can run a for loop and perform all the individual training jobs you need.

In this case, you will need to know which which instance is which so you can make the split of the HPOs. SageMaker writes this information on the file: /opt/ml/input/config/resourceconfig.json so using that you can easily have each instance run a subset of the trainings required.

Another thing to think of, is if you need to save the generated models (which you probably need). You can either save everything in the output model directory - standard SM approach- but this would zip all models in a model.tar.gz file. If you don't want this, and prefer to have each model individually saved, I'd suggest using the checkpoints directory that will sync anything written there to your s3 location.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM