[英]How to properly optimize shared network between actor and critic?
我正在建立一個演員評論強化學習算法來解決環境。 我想使用一個編碼器來查找我的環境。
當我與演員和評論家共享編碼器時,我的網絡無法學習任何東西:
class Encoder(nn.Module):
def __init__(self, state_dim):
super(Encoder, self).__init__()
self.l1 = nn.Linear(state_dim, 512)
def forward(self, state):
a = F.relu(self.l1(state))
return a
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim, 128)
self.l3 = nn.Linear(128, action_dim)
self.max_action = max_action
def forward(self, state):
a = F.relu(self.l1(state))
# a = F.relu(self.l2(a))
a = torch.tanh(self.l3(a)) * self.max_action
return a
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 128)
self.l3 = nn.Linear(128, 1)
def forward(self, state, action):
state_action = torch.cat([state, action], 1)
q = F.relu(self.l1(state_action))
# q = F.relu(self.l2(q))
q = self.l3(q)
return q
但是,當我對演員使用不同的編碼器而對評論家使用不同的編碼器時,它可以正常學習。
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, action_dim)
self.max_action = max_action
def forward(self, state):
a = F.relu(self.l1(state))
a = F.relu(self.l2(a))
a = torch.tanh(self.l3(a)) * self.max_action
return a
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, 1)
def forward(self, state, action):
state_action = torch.cat([state, action], 1)
q = F.relu(self.l1(state_action))
q = F.relu(self.l2(q))
q = self.l3(q)
return q
我很確定它的原因是優化程序。 在共享編碼器代碼中,我將其定義為foolow:
self.actor_optimizer = optim.Adam(list(self.actor.parameters())+
list(self.encoder.parameters()))
self.critic_optimizer = optim.Adam(list(self.critic.parameters()))
+list(self.encoder.parameters()))
在單獨的編碼器中,它只是:
self.actor_optimizer = optim.Adam((self.actor.parameters()))
self.critic_optimizer = optim.Adam((self.critic.parameters()))
必須使用actor評論家算法來優化兩個優化器。
如何結合兩個優化器來正確優化編碼器?
我不確定您共享編碼器的精確程度。
但是,我建議您創建編碼器的實例,並將其傳遞給演員和評論家
encoder_net = Encoder(state_dim)
actor = Actor(encoder_net, state_dim, action_dim, max_action)
critic = Critic(encoder_net, state_dim)
在正向傳遞過程中,首先將狀態批傳遞通過編碼器,然后再傳遞到網絡的其余部分,例如:
class Encoder(nn.Module):
def __init__(self, state_dim):
super(Encoder, self).__init__()
self.l1 = nn.Linear(state_dim, 512)
def forward(self, state):
a = F.relu(self.l1(state))
return a
class Actor(nn.Module):
def __init__(self, encoder, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.encoder = encoder
self.l1 = nn.Linear(512, 128)
self.l3 = nn.Linear(128, action_dim)
self.max_action = max_action
def forward(self, state):
state = self.encoder(state)
a = F.relu(self.l1(state))
# a = F.relu(self.l2(a))
a = torch.tanh(self.l3(a)) * self.max_action
return a
class Critic(nn.Module):
def __init__(self, encoder, state_dim):
super(Critic, self).__init__()
self.encoder = encoder
self.l1 = nn.Linear(512, 128)
self.l3 = nn.Linear(128, 1)
def forward(self, state):
state = self.encoder(state)
q = F.relu(self.l1(state))
# q = F.relu(self.l2(q))
q = self.l3(q)
return q
注意:評論器網絡現在是狀態值函數V(s)的函數近似器,而不是狀態作用值函數Q(s,a)的函數近似器。
通過此實現,您可以執行優化而無需將編碼器參數傳遞給優化器,如下所示:
self.actor_optimizer = optim.Adam((self.actor.parameters()))
self.critic_optimizer = optim.Adam((self.critic.parameters()))
因為現在兩個網絡之間共享編碼器參數。
希望這可以幫助! :)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.