简体   繁体   English

使用 lstm 训练的 model 需要多少个 epoch

[英]how many epochs required for model with lstm training

I am having nn Actor Critic TD3 model with LSTM in my AI.我的 AI 中有 nn Actor Critic TD3 model 和 LSTM。 For every training, I am creating batches of sequential data and training my AI.对于每次培训,我都会创建一批顺序数据并训练我的 AI。

Can someone expert please help to let know if I require epochs as well for this AI.And in general how many epochs can I run with this code, because I am creating many batches on one training step is it feasible to have epochs as well.请专家帮忙让我知道我是否也需要这个 AI 的 epoch。一般来说,我可以用这段代码运行多少个 epoch,因为我在一个训练步骤上创建了很多批次,是否也可以有 epoch。

Below is training step code下面是训练步骤代码

def train(
        self,
        replay_buffer,
        iterations,
        batch_size=50,
        discount=0.99,
        tau=0.005,
        policy_noise=0.2,
        noise_clip=0.5,
        policy_freq=2,
        ):
        
        b_state = torch.Tensor([])
        b_next_state = torch.Tensor([])
        b_done = torch.Tensor([])
        b_reward = torch.Tensor([])
        b_action = torch.Tensor([])

        for it in range(iterations):

            # print ('it: ', it, ' iterations: ', iterations)

      # Step 4: We sample a batch of transitions (s, s’, a, r) from the memory

            (batch_states, batch_next_states, batch_actions,
             batch_rewards, batch_dones) = \
                replay_buffer.sample(batch_size)

            batch_states = batch_states.astype(float)
            batch_next_states = batch_next_states.astype(float)
            batch_actions = batch_actions.astype(float)
            batch_rewards = batch_rewards.astype(float)
            batch_dones = batch_dones.astype(float)

            state = torch.from_numpy(batch_states)
            next_state = torch.from_numpy(batch_next_states)
            action = torch.from_numpy(batch_actions)
            reward = torch.from_numpy(batch_rewards)
            done = torch.from_numpy(batch_dones)

            b_size = 1
            seq_len = state.shape[0]
            batch = b_size
            input_size = state_dim

            state = torch.reshape(state, ( 1,seq_len, state_dim))
            next_state = torch.reshape(next_state, ( 1,seq_len, state_dim))
            done = torch.reshape(done, ( 1,seq_len, 1))
            reward = torch.reshape(reward, ( 1, seq_len, 1))
            action = torch.reshape(action, ( 1, seq_len, action_dim))
            
            b_state = torch.cat((b_state, state),dim=0)
            b_next_state = torch.cat((b_next_state, next_state),dim=0)
            b_done = torch.cat((b_done, done),dim=0)
            b_reward = torch.cat((b_reward, reward),dim=0)
            b_action = torch.cat((b_action, action),dim=0)
            
            # state = torch.reshape(state, (seq_len, 1, state_dim))
            # next_state = torch.reshape(next_state, (seq_len, 1,
            #         state_dim))
            # done = torch.reshape(done, (seq_len, 1, 1))
            # reward = torch.reshape(reward, (seq_len, 1, 1))
            # action = torch.reshape(action, (seq_len, 1, action_dim))
            
            # b_state = torch.cat((b_state, state),dim=1)
            # b_next_state = torch.cat((b_next_state, next_state),dim=1)
            # b_done = torch.cat((b_done, done),dim=1)
            # b_reward = torch.cat((b_reward, reward),dim=1)
            # b_action = torch.cat((b_action, action),dim=1)
                                                      
        print("dim state:",b_state.shape)

      # for h and c shape (num_layers * num_directions, batch, hidden_size)

        ha0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim)
        ca0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim)
        hc0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim + action_dim)
        cc0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim + action_dim)
        # Step 5: From the next state s’, the Actor target plays the next action a’
          
        b_next_action = self.actor_target(b_next_state, (ha0, ca0))
        b_next_action = b_next_action[0]
  
        # Step 6: We add Gaussian noise to this next action a’ and we clamp it in a range of values supported by the environment
          
        noise = torch.Tensor(b_next_action).data.normal_(0,
                policy_noise)
        noise = noise.clamp(-noise_clip, noise_clip)
        b_next_action = (b_next_action + noise).clamp(-self.max_action,
                self.max_action)
  
        # Step 7: The two Critic targets take each the couple (s’, a’) as input and return two Q-values Qt1(s’,a’) and Qt2(s’,a’) as outputs
        
        result = self.critic_target(b_next_state, b_next_action, (hc0,cc0))
        target_Q1 = result[0]
        target_Q2 = result[1]
  
        # Step 8: We keep the minimum of these two Q-values: min(Qt1, Qt2)
          
        target_Q = torch.min(target_Q1, target_Q2).double()
          
        # Step 9: We get the final target of the two Critic models, which is: Qt = r + γ * min(Qt1, Qt2), where γ is the discount factor
          
        target_Q = b_reward + (1 - b_done) * discount * target_Q
          
        # Step 10: The two Critic models take each the couple (s, a) as input and return two Q-values Q1(s,a) and Q2(s,a) as outputs
          
        b_action_reshape = torch.reshape(b_action, b_next_action.shape)
        result = self.critic(b_state, b_action_reshape, (hc0, cc0))
        current_Q1 = result[0]
        current_Q2 = result[1]
          
        # Step 11: We compute the loss coming from the two Critic models: Critic Loss = MSE_Loss(Q1(s,a), Qt) + MSE_Loss(Q2(s,a), Qt)
          
        critic_loss = F.mse_loss(current_Q1, target_Q) \
            + F.mse_loss(current_Q2, target_Q)
          
        # Step 12: We backpropagate this Critic loss and update the parameters of the two Critic models with a SGD optimizer
          
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
          
        # Step 13: Once every two iterations, we update our Actor model by performing gradient ascent on the output of the first Critic model
        
        out = self.actor(b_state, (ha0, ca0))
        out = out[0]
        (actor_loss, hx, cx) = self.critic.Q1(b_state, out, (hc0,cc0))
        actor_loss = -1 * actor_loss.mean()
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()
  
        # Step 14: Still once every two iterations, we update the weights of the Actor target by polyak averaging
          
        for (param, target_param) in zip(self.actor.parameters(),
                self.actor_target.parameters()):
            target_param.data.copy_(tau * param.data + (1 - tau)
                    * target_param.data)
  
        # Step 15: Still once every two iterations, we update the weights of the Critic target by polyak averaging
          
        for (param, target_param) in zip(self.critic.parameters(),
                self.critic_target.parameters()):
            target_param.data.copy_(tau * param.data + (1 - tau)
                    * target_param.data)

First I will say this, generally reinforcement learning requires A LOT of training.首先我要说的是,一般强化学习需要大量的训练。 This is however heavily dependent on the complexity of the problem you are trying to solve.然而,这在很大程度上取决于您要解决的问题的复杂性。 The situation may be made worse if your model is complicated (eg when using a LSTM).如果您的 model 很复杂(例如使用 LSTM 时),情况可能会变得更糟。 When training an agent to play Atari games you can expect needing up to 1 million episodes (depending on the game and the approach used).在训练代理玩 Atari 游戏时,您可能需要多达 100 万集(取决于游戏和使用的方法)。

With regard to epochs, if you mean repeated use of a particular episode (or collection of episodes) for training then it will depend on whether your using an on or off policy approach (for off policy this is like experience replay, epochs is the wrong word).关于时期,如果您的意思是重复使用特定情节(或情节集合)进行训练,那么这将取决于您使用的是开还是关策略方法(对于关策略,这就像经验重播,时期是错误的单词)。 Actor-Critic methods are generally on policy, which means they require fresh data at each stage of training. Actor-Critic 方法通常是基于策略的,这意味着它们在训练的每个阶段都需要新的数据。 Once an episode has been used for training it should not be used again.一旦一个情节被用于训练,就不应再次使用它。 For more information about the difference between on/off policy I recommend taking a look at Sutton's book .有关开/关政策之间差异的更多信息,我建议您查看Sutton 的书

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM