Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?
In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?
# define my estimator the standard way
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.10',
pytorch_version='1.9',
py_version='py38',
hyperparameters = hyperparameters,
metric_definitions=metric_definitions
)
# train the model
huggingface_estimator.fit(
{'train': training_input_path, 'test': test_input_path}
)
If I run huggingface_estimator.fit
again it will just start the whole thing over again and overwrite my previous training.
You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers .
(The example enables Spot instances, but you can use on-demand).
'output_dir':'/opt/ml/checkpoints'
.checkpoint_s3_uri
in the Estimator (which is unique to the series of jobs you'll run).from transformers.trainer_utils import get_last_checkpoint # check if checkpoint existing if so continue training if get_last_checkpoint(args.output_dir) is not None: logger.info("***** continue training *****") last_checkpoint = get_last_checkpoint(args.output_dir) trainer.train(resume_from_checkpoint=last_checkpoint) else: trainer.train()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.