简体   繁体   English

深度 Q 学习的输入状态

[英]Input states for Deep Q Learning

I am using the DQN for a resource allocation where the agent should assign the arrival requests to the best Virtual Machine.我使用 DQN 进行资源分配,其中代理应将到达请求分配给最佳虚拟机。 I am modifying a Cartpole code as follow:我正在修改 Cartpole 代码如下:

import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import os 

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95 
        self.epsilon = 1.0 
        self.epsilon_decay = 0.995 
        self.epsilon_min = 0.01 
        self.learning_rate = 0.001 
        self.model = self._build_model()
    
    def _build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu')) 
        model.add(Dense(24, activation='relu')) 
        model.add(Dense(self.action_size, activation='linear')) 
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        return model
    
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done)) 

    def act(self, state):
        if np.random.rand() <= self.epsilon: 
            return random.randrange(self.action_size)
        act_values = self.model.predict(state) 
        return np.argmax(act_values[0])

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size) 
        for state, action, reward, next_state, done in minibatch: 
            target = reward 
            if not done: 
                target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0])) 
            target_f = self.model.predict(state) 
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

The Cartpole states as the inputs of the Q network are given by the environment.作为 Q 网络输入的 Cartpole 状态由环境给出。

0   Cart Position
1   Cart Velocity       -Inf    Inf
2   Pole Angle          ~ -41.8°    ~ 41.8°
3   Pole Velocity At Tip

The question is that in my code what are the inputs of the Q network?问题是在我的代码中 Q 网络的输入是什么? Since the agent should take the best possible action based on the size of the arrival request but this is not given by the environment.因为代理应该根据到达请求的大小采取最好的行动,但这不是由环境给出的。 Shall I feed the Q network by this input value, the size?我应该通过这个输入值,大小来馈送 Q 网络吗?

The inputs of the Deep Q-Network architecture is fed by the replay memory, in the following part of the code: Deep Q-Network 架构的输入由重放内存提供,在代码的以下部分:

def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))

The dynamic of this system as shown in the original paper Deepmind paper , is that you interact with the system, store the transition in the replay memory, and then use it for the training step.如原始论文Deepmind 论文所示,该系统的动态是您与系统交互,将转换存储在重放内存中,然后将其用于训练步骤。 In the lines above you're storing these experiences.在上面的几行中,您存储了这些经验。

Basically, the input of the network is the states and outputs the Q-values.基本上,网络的输入是状态并输出 Q 值。 In your code, there's no interaction with the environment, that's when you can get these transitions (experiences) to feed the replay memory.在您的代码中,没有与环境的交互,那时您可以获得这些转换(体验)来提供重放记忆。 So, if you can't extract some information in the environment to be represented as states, you're not able to make assumptions about that.因此,如果您无法提取环境中的某些信息以表示为状态,则您无法对此做出假设。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM