简体   繁体   English

使用神经网络将旧系统更新为Q学习

[英]Updating an old system to Q-learning with Neural Networks

Recently I've been reading a lot about Q-learning with Neural Networks and thought about to update an existing old optimization system in a power plant boiler composed of a simple feed-forward neural network approximating an output from many sensory inputs. 最近我一直在阅读很多关于神经网络Q学习的内容,并考虑更新现有的旧优化系统,该系统由一个简单的前馈神经网络组成,该系统近似于许多感知输入的输出。 The output then is linked to a linear model-based controller that somehow output again an optimal action so the whole model can converge to a desired goal. 然后输出链接到基于线性模型的控制器,该控制器以某种方式再次输出最佳动作,因此整个模型可以收敛到期望的目标。

Identifying linear models is a consuming task. 识别线性模型是一项耗费任务。 I thought about refurbishing the whole thing to model- free Q-learning with a Neural Network approximation of the Q-function. 我考虑用Q函数的神经网络近似来将整个事物翻新为无模型Q学习。 I drew a diagram to ask you if I'm on the right track or not. 我画了一张图表,问你我是否在正确的轨道上。

模型

My question: if you think I understood well the concept, should my training set be composed of State Features vectors from one side and Q_target - Q_current (here I'm assuming there's an increasing reward) in order to force the whole model towards the target or am I missing something? 我的问题:如果你认为我理解这个概念,我的训练集应该由一侧的State Features vectorsQ_target - Q_current (这里我假设有一个增加的奖励),以迫使整个模型朝向目标还是我错过了什么?

Note: The diagram shows a comparison between the old system in the upper part and my proposed change on the lower part. 注意:该图显示了上部旧系统与下部建议更改之间的比较。

EDIT: Does a State Neural Network guarantee Experience Replay? 编辑:状态神经网络是否保证体验重播?

You might just use all the Q value of all the actions in the current state as the output layer in your network. 您可以只使用当前状态中所有操作的所有Q值作为网络中的输出层。 A poorly drawn diagram is here 这里绘制的图表很糟糕

You can therefore take advatange of NN's ability to output multiple Q value at a time. 因此,您可以优先考虑NN一次输出多个Q值的能力。 Then, just back prop using loss derived by Q(s, a) <- Q(s, a) + alpha * (reward + discount * max(Q(s', a')) - Q(s, a) , where max(Q(s', a')) can be easily computed from the output layer. 然后,使用由Q(s, a) <- Q(s, a) + alpha * (reward + discount * max(Q(s', a')) - Q(s, a)导出的损失返回道具,其中max(Q(s', a'))可以很容易地从输出层计算出来。

Please let me know if you have further questions. 如果您有其他问题,请与我们联系。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM