简体   繁体   中英

Save model checkpoint only when model shows improvement in TensorFlow

Do you know if there is a way to chose which model is saved when using Estimator wrapped in an experiment? Because every 'save_checkpoints_steps', the model is saved but this model is not necessarily the best.

def model_fn(features, labels, mode, params):
    predict = model_predict_()
    loss = model_loss()
    train_op = model_train_op(loss, mode)       
    predictions = {"predictions": predict}

    return tf.estimator.EstimatorSpec(
        mode = mode,
        predictions = predictions,
        loss = loss,
        train_op = train_op,
    )

def experiment_fn(run_config, hparams):
    estimator = tf.estimator.Estimator(
        model_fn = model_fn, 
        config = run_config,
        params = hparams
    )

    return learn.Experiment(
      estimator = estimator,
      train_input_fn = train_input_fn,
      eval_input_fn = eval_input_fn,
      eval_metrics = None,
      train_steps = 1000,
    )

ex = learn_runner.run(
        experiment_fn = experiment_fn,
        run_config = run_config,
        schedule = "train_and_evaluate",
        hparams =  hparams
)

the output is as follow:

INFO:tensorflow:Saving checkpoints for 401 into .\\model.ckpt.

INFO:tensorflow:global_step/sec: 0.157117 INFO:tensorflow:step = 401, loss = 2.95048 (636.468 sec)

INFO:tensorflow:Starting evaluation at 2017-09-05-20:06:07 INFO:tensorflow:Restoring parameters from .\\model.ckpt-401

INFO:tensorflow:Evaluation [1/1] INFO:tensorflow:Finished evaluation at 2017-09-05-20:06:09

INFO:tensorflow:Saving dict for global step 401: global_step = 401, loss = 7.20411

INFO:tensorflow:Validation (step 401): global_step = 401, loss = 7.20411

INFO:tensorflow:training loss = 2.95048, step = 401 (315.393 sec)

INFO:tensorflow:Saving checkpoints for 451 into .\\model.ckpt.

INFO:tensorflow:Starting evaluation at 2017-09-05-20:11:32

INFO:tensorflow:Restoring parameters from .\\model.ckpt-451

INFO:tensorflow:Evaluation [1/1]

You see that every time it saves the last model, which is not necessarily the best.

Checkpoints are saved for the event that your training process is interrupted. If you don't have checkpoints you will need to restart from scratch. This is a big issue for big models that take weeks to train.

Once your training is done and you are satisfied with your model (in your words, "it is the best"), you can save it explicitly using https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#export_savedmodel . Call this method is on the Estimator that you used to create your Experiemnt . Note that this method saves "model for inference" meaning that all the gradient ops will be stripped form it and not saved.

EDIT: In Reply to Nicolas's comment: You can save snapshots periodically in addition to the most recent ones using keep_checkpoint_every_n_hours option to RunConfig that you pass when creating an estimator. If you then find that your model achieved best performance 10 hours ago, you should be able to find a snapshot from roughly that time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM