How can I import a ray rllib pytorch whole model into next round training and subsquent inference using torch save load method other than checkpoints

Question

In ray rllib, I usually apply ray.tune.run a ppo trainning like this:

ray.init(log_to_driver=False, num_cpus=3, 
    local_mode=args.local_mode, num_gpus=1)
env_config={"code":"codeA"}
config={
 env_config={
     "code":"codeA"},
 "parm":"paramA"}
stop = {
    "training_iteration": args.stop_iters,
    "timesteps_total": args.stop_timesteps,
    "episode_reward_mean": args.stop_reward,
}
results = tune.run(trainer, config=config1, verbose=0, 
  stop=stop1, checkpoint_at_end=True,                               
  metric='episode_reward_mean', mode="max", 
  checkpoint_freq=1
             )

  checkpoints = results.get_trial_checkpoints_paths(
    trial=results.get_best_trial(
    metric='episode_reward_mean', 
    mode="max"),metric='episode_reward_mean')
  checkpoint_path = checkpoints[0][0]
  metric = checkpoints[0][1]

At next round, I usually retrain the model using restore checkpoints method like this:

 results = tune.run('PPO', config=config1, verbose=0, 
      stop=stop, checkpoint_at_end=True,                                   
      metric='episode_reward_mean', mode="max", checkpoint_freq=1, restore=checkpoint_path)

In inference:

agent = ppo.PPOTrainer(config=config1, env=env)
agent.restore(checkpoint_path=checkpoint_path)

Those flow has worked. The questions are (1): If can I save the whole pytorch model at the end of ray.tune.run? (2) can I import the pytorch model at the next round ray.tune.run training other than checkpoints restoring? (3) at inferece stage, how can I import the trained whole pytorch model into the PPO agent? In the restore agent inference flow, I can not load more than 1o models into the computer memory at a time. The big loading shows an OOM problem. If I restore a model one by one, the checkpoint restoring process is too time-consuming and cannot meet the timeliness requirements. Can any one help me?

Answer 1

您可以在 tune.run() 中查看 keep_checkpoints_num 和 checkpoints_score_attr 以自定义您想要的检查点数量

How can I import a ray rllib pytorch whole model into next round training and subsquent inference using torch save load method other than checkpoints

Question

1 answers

solution1
0 2022-01-05 19:42:08

How can I import a ray rllib pytorch whole model into next round training and subsquent inference using torch save load method other than checkpoints

Question

1 answers

solution1 0 2022-01-05 19:42:08

solution1
0 2022-01-05 19:42:08