[英]Reinforcement Learning with Keras model
I was trying to implement a q-learning algorithms in Keras. 我试图在Keras中实现q学习算法。 According to the articles i found these lines of code.
根据文章,我发现了这些代码行。
for state, action, reward, next_state, done in sample_batch:
target = reward
if not done:
#formula
target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
target_f = self.brain.predict(state)
#shape (1,2)
target_f[0][action] = target
print(target_f.shape)
self.brain.fit(state, target_f, epochs=1, verbose=0)
if self.exploration_rate > self.exploration_min:
self.exploration_rate *= self.exploration_decay
Variable sample_batch
is the array that contains sample state, action, reward, next_state, done
from collected data. 变量
sample_batch
是一个数组,其中包含根据收集的数据state, action, reward, next_state, done
样本state, action, reward, next_state, done
。 I also found the following q-learning formula 我还发现以下q学习公式
Why is there no -
sign in the equation(code)? 为什么没有
-
在方程式(代码)中签名? I found out that np.amax
returns the maximum of an array or maximum along an axis. 我发现
np.amax
返回数组的最大值或沿轴的最大值。 When i call self.brain.predict(next_state)
, I get [[-0.06427538 -0.34116858]]
. 当我调用
self.brain.predict(next_state)
,我得到[[-0.06427538 -0.34116858]]
。 So it plays the role of prediction in this equation? 那么它在这个方程中扮演了预测的角色吗? As we go forward
target_f
is the predicted output for the current state and then we also append to it the reward with this step. 在前进时,
target_f
是当前状态的预测输出,然后我们还将在这一步骤中添加奖励。 Then, we train model on current state
( X
) and target_f
( Y
). 然后,我们在当前
state
( X
)和target_f
( Y
)上训练模型。 I have a few questions. 我有几个问题。 What is the role of
self.brain.predict(next_state)
and why there is no minus? self.brain.predict(next_state)
的作用是什么,为什么没有减号? Why do we predict twice on one model? 为什么我们对一个模型进行两次预测? Ex
self.brain.predict(state) and self.brain.predict(next_state)[0]
前
self.brain.predict(state) and self.brain.predict(next_state)[0]
Why is there no - sign in the equation(code)?
为什么没有-在方程式(代码)中签名?
It's because loss calculation is done inside the fit function. 这是因为损耗计算是在fit函数中完成的。
reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
This is the same as the target component in the loss function. 这与损失函数中的目标组件相同。
Inside the fit method in keras loss will be calculated as given below. 在喀拉喀什的拟合方法中,损耗的计算如下。 For a single training data point (standard notations of neural networks),
对于单个训练数据点(神经网络的标准符号),
x = input state
y = predicted value
y_i = target value
loss(x) = y_i - y
at this step target - prediction happens internally. 在此步骤中, 目标-内部进行预测 。
Why do we predict twice on one model?
为什么我们对一个模型进行两次预测?
Good question !!! 好问题 !!!
target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
In this step we are predicting the value of the next state to calculate target value for state s if we take specific action a (which is denoted as Q(s,a) ) 在此步骤中, 如果我们采取特定措施a (表示为Q(s,a)),则我们将预测下一个状态的值以计算状态s的目标值。
target_f = self.brain.predict(state)
In this step we are calculating all the Q values for every action which we can take in state s. 在此步骤中,我们将计算可以在状态s中执行的每个动作的所有Q值 。
target = 1.00 // target is a single value for action a
target_f = (0.25,0.25,0.25,0.25) //target_f is a list of values for all actions
following step is then executed. 然后执行以下步骤。
target_f[0][action] = target
we only change the value of selected action. 我们只会更改所选操作的值。 ( if we take action 3 )
(如果我们采取措施3)
target_f = (0.25,0.25,1.00,0.25) // only action 3 value will change
Now target_f will be the actual target value we are trying to predict with correct shape. 现在target_f将是我们试图以正确形状预测的实际目标值 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.