简体   繁体   中英

incremental training on custom code in amazon sagemaker

I'm moving my first steps in amazon sagemaker . I'm using script mode to train a classification algorithm. Training is fine, however I'm not able to do incremental training. I want to train again the same model with new data. Here what I did. This is my script:

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker import get_execution_role

bucket = 'sagemaker-blablabla'
train_data = 's3://{}/{}'.format(bucket,'train')
validation_data = 's3://{}/{}'.format(bucket,'test')

s3_output_location = 's3://{}'.format(bucket)

tf_estimator = TensorFlow(entry_point='main.py', 
                          role=get_execution_role(),
                          train_instance_count=1, 
                          train_instance_type='ml.p2.xlarge',
                          framework_version='1.12', 
                          py_version='py3',
                          output_path=s3_output_location)

inputs = {'train': train_data, 'test': validation_data}
tf_estimator.fit(inputs)

The entry point is my custom keras code, which I adapted to receive arguments from the script. Now the training is successfully completed and I have in my s3 bucket the model.tar.gz. I want to train again, but it's not clear to me how to do it.. I tried this

trained_model = 's3://sagemaker-blablabla/sagemaker-tensorflow-scriptmode-2019-11-27-12-01-42-300/output/model.tar.gz'

tf_estimator = sagemaker.estimator.Estimator(image_name='blablabla-west-1.amazonaws.com/sagemaker-tensorflow-scriptmode:1.12-gpu-py3', 
                                              role=get_execution_role(),
                                              train_instance_count=1, 
                                              train_instance_type='ml.p2.xlarge',
                                              output_path=s3_output_location,
                                              model_uri = trained_model)

inputs = {'train': train_data, 'test': validation_data}

tf_estimator.fit(inputs)

Doesn't work. Firstly, I don't know how to retrieve the training image name (for this I looked for it in the aws console, but I guess there should be a smarter solution), second this code throws an exception about the entry point but it is my understanding that I shouldn't need it when I do incremental learning with a ready image. I'm surely missing something important, any help? Thank you!

Incremental training is a native feature for the built-in Image Classifier and Object Detector . For custom code, it is the developer responsibility to write the incremental training logic and to verify its validity. Here is a possible path:

  1. use one of the data channels passed in the fit to load a model state (artifact to fine-tune)
  2. in your code, check if the model state channel is filled with artifacts. If it is, instantiate a model from that state and continue training. This is framework specific and you may to take necessary precautions to avoid forgetting previous learnings.

Some frameworks provide better support for incremental learning that others. For example some sklearn models provide an incremental_fit method. For DL frameworks it is technically very easy to continue training from a checkpoint, but if new data is very different from previously-seen data this may lead your model to forget previous learnings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM