联合强化学习

Question

I am implementing federated deep Q-learning by PyTorch, using multiple agents, each running DQN.我正在通过 PyTorch 实施联合深度 Q 学习，使用多个代理，每个代理都运行 DQN。 My problem is that when I use multiple replay buffers for agents, each appending experiences at the corresponding agent, two elements of experiences in each agent replay buffer, ie, "current_state" and "next_state" becomes the same after the first time slot.我的问题是，当我为代理使用多个重播缓冲区时，每个都在相应的代理处附加经验，每个代理重播缓冲区中的两个经验元素，即“current_state”和“next_state”在第一个时间段后变得相同。 I mean in each buffer, we see the same values for current states and the same values for next states .我的意思是在每个缓冲区中，我们看到当前状态的相同值和下一个状态的相同值。 I have included simplified parts of the code and results below.我在下面包含了代码和结果的简化部分。 Whay is it changing the current states and next states already exixting in the buffer when doing append?为什么在执行 append 时会改变缓冲区中已经存在的当前状态和下一个状态？ Is there something wrong with defining the buffers as a global variable?将缓冲区定义为全局变量有什么问题吗？ or do you have another idea?还是你有别的想法？

<<< time 0 and agent 0:
current_state[0] = [1,2]
next_state[0] = [11,12]
*** experience: (array([ 1., 2.]), 2.0, array([200]), array([ 11., 12.]), 0)
*** buffer: deque([(array([ 1., 2.]), 2.0, array([200]), array([ 11., 12.]), 0)], maxlen=10000)

<<< time 0 and agent 1: 
current_state[1] = [3, 4]
next_state[1] = [13, 14]
*** experience: (array([ 3., 4.]), 4.0, array([400]), array([ 13., 14.]), 0)
*** buffer: deque([(array([ 1., 2.]), 4.0, array([400]), array([ 11., 12.]), 0)], maxlen=10000)

<<< time 1 and agent 0:
current_state = [11,12]
next_state[0] = [110, 120]
*** experience: (array([ 11., 12.]), 6.0, array([600]), array([ 110., 120.]), 0)
*** buffer: deque([(array([ 11., 12.]), 2.0, array([200]), array([ 110., 120.]), 0),(array([ 11., 12.]), 6.0, array([600]), array([ 110., 120.]), 0)], maxlen=10000)

<<< time 1 and agent 1:
current_state = [13, 14]
next_state[1] = [130, 140]
*** experience: (array([ 13., 14.]), 8.0, array([800]), array([ 130., 140.]), 0)
*** buffer: deque([(array([ 13., 14.]), 4.0, array([400]), array([ 130., 140.]), 0),(array([ 13., 14.]), 8.0, array([800]), array([ 130., 140.]), 0)], maxlen=10000)

class BasicBuffer:
def __init__(self, max_size):
    self.max_size = max_size
    self.buffer = deque(maxlen=10000)

def add(self, current_state, action, reward, next_state, done):
    ## """"Add a new experience to buffer.""""
    experience = (current_state, action, np.array([reward]), next_state, done)
    self.buffer.append(experience)

def DQNtrain(env, state_size, agent):
for time in range(time_max):
    for e in range(agents_numbers):
       current_state[e,:]
        next_state_edge[e, :] 
        ## """"Add a new experience to buffer.""""
        replay_buffer_t[e].add(current_state, action, reward, next_state, done)
        current_state[e, :] = next_state[e, :]

if __name__ == '__main__':
   DQNtrain(env, state_size, agent)
   replay_buffer_t = [[] for _ in range(edge_max)]
   for e in range(edge_max):
       replay_buffer_t[e] = BasicBuffer(max_size=agent_buffer_size)

Answer 1

I just found what is causing the problem.我刚刚找到导致问题的原因。 I should have used copy.deepcopy() for experiences:我应该使用 copy.deepcopy() 来获得经验：

experience = copy.deepcopy((current_state, action, np.array([reward]), next_state, done))
self.buffer.append(experience)

联合强化学习

问题描述

1 个解决方案

解决方案1
0 2021-03-31 05:34:10

联合强化学习

问题描述

1 个解决方案

解决方案1 0 2021-03-31 05:34:10

解决方案1
0 2021-03-31 05:34:10