Train an already trained model in Sagemaker and Huggingface without re-initialising

Question

Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?

In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?

# define my estimator the standard way
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='./scripts',
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.10',
    pytorch_version='1.9',
    py_version='py38',
    hyperparameters = hyperparameters,
    metric_definitions=metric_definitions
)

# train the model
huggingface_estimator.fit(
    {'train': training_input_path, 'test': test_input_path}
)

If I run huggingface_estimator.fit again it will just start the whole thing over again and overwrite my previous training.

Answer 1

You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers .
(The example enables Spot instances, but you can use on-demand).

In hyperparameters you set: 'output_dir':'/opt/ml/checkpoints' .
You define a checkpoint_s3_uri in the Estimator (which is unique to the series of jobs you'll run).
You add code for train.py to support checkpointing:

 from transformers.trainer_utils import get_last_checkpoint # check if checkpoint existing if so continue training if get_last_checkpoint(args.output_dir) is not None: logger.info("***** continue training *****") last_checkpoint = get_last_checkpoint(args.output_dir) trainer.train(resume_from_checkpoint=last_checkpoint) else: trainer.train()

Train an already trained model in Sagemaker and Huggingface without re-initialising

Question

1 answers

solution1
1 ACCPTED 2022-04-13 06:38:31

Train an already trained model in Sagemaker and Huggingface without re-initialising

Question

1 answers

solution1 1 ACCPTED 2022-04-13 06:38:31

solution1
1 ACCPTED 2022-04-13 06:38:31