简体   繁体   中英

Train an already trained model in Sagemaker and Huggingface without re-initialising

Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?

In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?

# define my estimator the standard way
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='./scripts',
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.10',
    pytorch_version='1.9',
    py_version='py38',
    hyperparameters = hyperparameters,
    metric_definitions=metric_definitions
)

# train the model
huggingface_estimator.fit(
    {'train': training_input_path, 'test': test_input_path}
)

If I run huggingface_estimator.fit again it will just start the whole thing over again and overwrite my previous training.

You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers .
(The example enables Spot instances, but you can use on-demand).

  1. In hyperparameters you set: 'output_dir':'/opt/ml/checkpoints' .
  2. You define a checkpoint_s3_uri in the Estimator (which is unique to the series of jobs you'll run).
  3. You add code for train.py to support checkpointing:
 from transformers.trainer_utils import get_last_checkpoint # check if checkpoint existing if so continue training if get_last_checkpoint(args.output_dir) is not None: logger.info("***** continue training *****") last_checkpoint = get_last_checkpoint(args.output_dir) trainer.train(resume_from_checkpoint=last_checkpoint) else: trainer.train()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM