简体   繁体   中英

How to properly optimize shared network between actor and critic?

I'm building an actor-critic reinforcment learning algorithm to solve environments. I want to use a single encoder to find representation of my environment.

When I share the encoder with the actor and the critic, my network isn't learning anything:

class Encoder(nn.Module):
  def __init__(self, state_dim):
    super(Encoder, self).__init__()

    self.l1 = nn.Linear(state_dim, 512)

  def forward(self, state):
    a = F.relu(self.l1(state))
    return a

class Actor(nn.Module):
  def __init__(self, state_dim, action_dim, max_action):
    super(Actor, self).__init__()

    self.l1 = nn.Linear(state_dim, 128)
    self.l3 = nn.Linear(128, action_dim)

    self.max_action = max_action

  def forward(self, state):
    a = F.relu(self.l1(state))
    # a = F.relu(self.l2(a))
    a = torch.tanh(self.l3(a)) * self.max_action
    return a

class Critic(nn.Module):
  def __init__(self, state_dim, action_dim):
    super(Critic, self).__init__()

    self.l1 = nn.Linear(state_dim + action_dim, 128)
    self.l3 = nn.Linear(128, 1)

  def forward(self, state, action):
    state_action = torch.cat([state, action], 1)

    q = F.relu(self.l1(state_action))
    # q = F.relu(self.l2(q))
    q = self.l3(q)
    return q

However, when I use different encoder for the actor and different for the critic, it learn properly.

class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
    super(Actor, self).__init__()

    self.l1 = nn.Linear(state_dim, 400)
    self.l2 = nn.Linear(400, 300)
    self.l3 = nn.Linear(300, action_dim)

    self.max_action = max_action

def forward(self, state):
    a = F.relu(self.l1(state))
    a = F.relu(self.l2(a))
    a = torch.tanh(self.l3(a)) * self.max_action
    return a

class Critic(nn.Module):
  def __init__(self, state_dim, action_dim):
    super(Critic, self).__init__()

    self.l1 = nn.Linear(state_dim + action_dim, 400)
    self.l2 = nn.Linear(400, 300)
    self.l3 = nn.Linear(300, 1)

  def forward(self, state, action):
    state_action = torch.cat([state, action], 1)

    q = F.relu(self.l1(state_action))
    q = F.relu(self.l2(q))
    q = self.l3(q)
    return q

Im pretty sure its becuase of the optimizer. In the shared encoder code, I define it as foolow:

self.actor_optimizer = optim.Adam(list(self.actor.parameters())+
                                      list(self.encoder.parameters()))
self.critic_optimizer = optim.Adam(list(self.critic.parameters()))
                                         +list(self.encoder.parameters()))

In the seperate encoder, its just:

self.actor_optimizer = optim.Adam((self.actor.parameters()))
self.critic_optimizer = optim.Adam((self.critic.parameters()))

two optimizers must be becuase of the actor critic algorithm.

How can I combine two optimizers to optimize correctly the encoder?

I am not sure how exactly you are sharing the encoder.

However, I would suggest that you create an instance of the encoder and pass it to both the actor and critic

encoder_net = Encoder(state_dim)
actor = Actor(encoder_net, state_dim, action_dim, max_action)
critic = Critic(encoder_net, state_dim)

and during the forward pass, pass first the state batch first through the encoder then through the rest of the network, like this for example:

class Encoder(nn.Module):
    def __init__(self, state_dim):
        super(Encoder, self).__init__()

        self.l1 = nn.Linear(state_dim, 512)

    def forward(self, state):
        a = F.relu(self.l1(state))
        return a

class Actor(nn.Module):
    def __init__(self, encoder, state_dim, action_dim, max_action):
        super(Actor, self).__init__()
        self.encoder = encoder

        self.l1 = nn.Linear(512, 128)
        self.l3 = nn.Linear(128, action_dim)

        self.max_action = max_action

    def forward(self, state):
        state = self.encoder(state)
        a = F.relu(self.l1(state))
        # a = F.relu(self.l2(a))
        a = torch.tanh(self.l3(a)) * self.max_action
        return a

class Critic(nn.Module):
    def __init__(self, encoder, state_dim):
        super(Critic, self).__init__()
        self.encoder = encoder

        self.l1 = nn.Linear(512, 128)
        self.l3 = nn.Linear(128, 1)

    def forward(self, state):
        state = self.encoder(state)

        q = F.relu(self.l1(state))
        # q = F.relu(self.l2(q))
        q = self.l3(q)
        return q

Note: The critic network is now a function approximator for the state value function V(s) and not the state-action value function Q(s,a).

With this implementation you can perform optimization without passing the encoder parameters to the optimizer, like this:

self.actor_optimizer = optim.Adam((self.actor.parameters()))
self.critic_optimizer = optim.Adam((self.critic.parameters()))

Because the encoder parameters are now shared between both networks.

Hope this helps! :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM