简体   繁体   English

用Keras模型进行强化学习

[英]Reinforcement Learning with Keras model

I was trying to implement a q-learning algorithms in Keras. 我试图在Keras中实现q学习算法。 According to the articles i found these lines of code. 根据文章,我发现了这些代码行。

for state, action, reward, next_state, done in sample_batch:
        target = reward
        if not done:
            #formula
          target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
        target_f = self.brain.predict(state)
        #shape (1,2)
        target_f[0][action] = target
        print(target_f.shape)
        self.brain.fit(state, target_f, epochs=1, verbose=0)
    if self.exploration_rate > self.exploration_min:
        self.exploration_rate *= self.exploration_decay

Variable sample_batch is the array that contains sample state, action, reward, next_state, done from collected data. 变量sample_batch是一个数组,其中包含根据收集的数据state, action, reward, next_state, done样本state, action, reward, next_state, done I also found the following q-learning formula 我还发现以下q学习公式 式

Why is there no - sign in the equation(code)? 为什么没有-在方程式(代码)中签名? I found out that np.amax returns the maximum of an array or maximum along an axis. 我发现np.amax返回数组的最大值或沿轴的最大值。 When i call self.brain.predict(next_state) , I get [[-0.06427538 -0.34116858]] . 当我调用self.brain.predict(next_state) ,我得到[[-0.06427538 -0.34116858]] So it plays the role of prediction in this equation? 那么它在这个方程中扮演了预测的角色吗? As we go forward target_f is the predicted output for the current state and then we also append to it the reward with this step. 在前进时, target_f是当前状态的预测输出,然后我们还将在这一步骤中添加奖励。 Then, we train model on current state ( X ) and target_f ( Y ). 然后,我们在当前stateX )和target_fY )上训练模型。 I have a few questions. 我有几个问题。 What is the role of self.brain.predict(next_state) and why there is no minus? self.brain.predict(next_state)的作用是什么,为什么没有减号? Why do we predict twice on one model? 为什么我们对一个模型进行两次预测? Ex self.brain.predict(state) and self.brain.predict(next_state)[0] self.brain.predict(state) and self.brain.predict(next_state)[0]

Why is there no - sign in the equation(code)? 为什么没有-在方程式(代码)中签名?

It's because loss calculation is done inside the fit function. 这是因为损耗计算是在fit函数中完成的。

reward + self.gamma * np.amax(self.brain.predict(next_state)[0])

This is the same as the target component in the loss function. 这与损失函数中的目标组件相同。

Inside the fit method in keras loss will be calculated as given below. 在喀拉喀什的拟合方法中,损耗的计算如下。 For a single training data point (standard notations of neural networks), 对于单个训练数据点(神经网络的标准符号),

x = input state

y = predicted value

y_i = target value

loss(x) = y_i - y

at this step target - prediction happens internally. 在此步骤中, 目标-内部进行预测

Why do we predict twice on one model? 为什么我们对一个模型进行两次预测?

Good question !!! 好问题 !!!

 target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])

In this step we are predicting the value of the next state to calculate target value for state s if we take specific action a (which is denoted as Q(s,a) ) 在此步骤中, 如果我们采取特定措施a (表示为Q(s,a)),则我们将预测下一个状态的值以计算状态s的目标值

 target_f = self.brain.predict(state)

In this step we are calculating all the Q values for every action which we can take in state s. 在此步骤中,我们将计算可以在状态s中执行的每个动作的所有Q值

target = 1.00    // target is a single value for action a
target_f = (0.25,0.25,0.25,0.25)   //target_f is a list of values for all actions

following step is then executed. 然后执行以下步骤。

target_f[0][action] = target

we only change the value of selected action. 我们只会更改所选操作的值。 ( if we take action 3 ) (如果我们采取措施3)

target_f = (0.25,0.25,1.00,0.25)  // only action 3 value will change

Now target_f will be the actual target value we are trying to predict with correct shape. 现在target_f将是我们试图以正确形状预测的实际目标值

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM