用Keras模型进行强化学习

Question

I was trying to implement a q-learning algorithms in Keras. 我试图在Keras中实现q学习算法。 According to the articles i found these lines of code. 根据文章，我发现了这些代码行。

for state, action, reward, next_state, done in sample_batch:
        target = reward
        if not done:
            #formula
          target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
        target_f = self.brain.predict(state)
        #shape (1,2)
        target_f[0][action] = target
        print(target_f.shape)
        self.brain.fit(state, target_f, epochs=1, verbose=0)
    if self.exploration_rate > self.exploration_min:
        self.exploration_rate *= self.exploration_decay

Variable sample_batch is the array that contains sample state, action, reward, next_state, done from collected data. 变量sample_batch是一个数组，其中包含根据收集的数据state, action, reward, next_state, done样本state, action, reward, next_state, done 。 I also found the following q-learning formula 我还发现以下q学习公式

Why is there no - sign in the equation(code)? 为什么没有-在方程式（代码）中签名？ I found out that np.amax returns the maximum of an array or maximum along an axis. 我发现np.amax返回数组的最大值或沿轴的最大值。 When i call self.brain.predict(next_state) , I get [[-0.06427538 -0.34116858]] . 当我调用self.brain.predict(next_state) ，我得到[[-0.06427538 -0.34116858]] 。 So it plays the role of prediction in this equation? 那么它在这个方程中扮演了预测的角色吗？ As we go forward target_f is the predicted output for the current state and then we also append to it the reward with this step. 在前进时， target_f是当前状态的预测输出，然后我们还将在这一步骤中添加奖励。 Then, we train model on current state ( X ) and target_f ( Y ). 然后，我们在当前state （ X ）和target_f （ Y ）上训练模型。 I have a few questions. 我有几个问题。 What is the role of self.brain.predict(next_state) and why there is no minus? self.brain.predict(next_state)的作用是什么，为什么没有减号？ Why do we predict twice on one model? 为什么我们对一个模型进行两次预测？ Ex self.brain.predict(state) and self.brain.predict(next_state)[0] 前self.brain.predict(state) and self.brain.predict(next_state)[0]

Answer 1

Why is there no - sign in the equation(code)? 为什么没有-在方程式（代码）中签名？

It's because loss calculation is done inside the fit function. 这是因为损耗计算是在fit函数中完成的。

reward + self.gamma * np.amax(self.brain.predict(next_state)[0])

This is the same as the target component in the loss function. 这与损失函数中的目标组件相同。

Inside the fit method in keras loss will be calculated as given below. 在喀拉喀什的拟合方法中，损耗的计算如下。 For a single training data point (standard notations of neural networks), 对于单个训练数据点（神经网络的标准符号），

x = input state

y = predicted value

y_i = target value

loss(x) = y_i - y

at this step target - prediction happens internally. 在此步骤中， 目标-内部进行预测。

Why do we predict twice on one model? 为什么我们对一个模型进行两次预测？

Good question !!! 好问题！！！

 target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])

In this step we are predicting the value of the next state to calculate target value for state s if we take specific action a (which is denoted as Q(s,a) ) 在此步骤中， 如果我们采取特定措施a （表示为Q（s，a）），则我们将预测下一个状态的值以计算状态s的目标值。

 target_f = self.brain.predict(state)

In this step we are calculating all the Q values for every action which we can take in state s. 在此步骤中，我们将计算可以在状态s中执行的每个动作的所有Q值。

target = 1.00    // target is a single value for action a
target_f = (0.25,0.25,0.25,0.25)   //target_f is a list of values for all actions

following step is then executed. 然后执行以下步骤。

target_f[0][action] = target

we only change the value of selected action. 我们只会更改所选操作的值。 ( if we take action 3 ) （如果我们采取措施3）

target_f = (0.25,0.25,1.00,0.25)  // only action 3 value will change

Now target_f will be the actual target value we are trying to predict with correct shape. 现在target_f将是我们试图以正确形状预测的实际目标值 。

用Keras模型进行强化学习

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-10-31 14:33:27

用Keras模型进行强化学习

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-10-31 14:33:27

解决方案1
3 已采纳 2018-10-31 14:33:27