简体繁体 English

如何将强化学习应用于连续动作空间？

[英]How can I apply reinforcement learning to continuous action spaces?

原文 2011-08-17 19:54:44 6 6 algorithm/ machine-learning/ reinforcement-learning/ q-learning

I'm trying to get an agent to learn the mouse movements necessary to best perform some task in a reinforcement learning setting (ie the reward signal is the only feedback for learning).我试图让代理学习在强化学习设置中最好地执行某些任务所需的鼠标移动（即奖励信号是学习的唯一反馈）。

I'm hoping to use the Q-learning technique, but while I've found a way to extend this method to continuous state spaces , I can't seem to figure out how to accommodate a problem with a continuous action space.我希望使用 Q-learning 技术，但是虽然我找到了一种将这种方法扩展到连续 state 空间的方法，但我似乎无法弄清楚如何解决连续动作空间的问题。

I could just force all mouse movement to be of a certain magnitude and in only a certain number of different directions, but any reasonable way of making the actions discrete would yield a huge action space.我可以强制所有鼠标移动具有一定的幅度并且仅在一定数量的不同方向上，但是任何使动作离散的合理方法都会产生巨大的动作空间。 Since standard Q-learning requires the agent to evaluate all possible actions, such an approximation doesn't solve the problem in any practical sense.由于标准 Q-learning 要求智能体评估所有可能的动作，因此这种近似在任何实际意义上都不能解决问题。

6 个解决方案

Fast forward to this year, folks from DeepMind proposes a deep reinforcement learning actor-critic method for dealing with both continuous state and action space.快进到今年，来自 DeepMind 的人们提出了一种深度强化学习 actor-critic 方法，用于处理连续 state 和动作空间。 It is based on a technique called deterministic policy gradient.它基于一种称为确定性策略梯度的技术。 See the paper Continuous control with deep reinforcement learning and some implementations .请参阅论文持续控制与深度强化学习和一些实现。

There are numerous ways to extend reinforcement learning to continuous actions.有许多方法可以将强化学习扩展到连续动作。 One way is to use actor-critic methods.一种方法是使用actor-critic 方法。 Another way is to use policy gradient methods.另一种方法是使用策略梯度方法。

A rather extensive explanation of different methods can be found in the following paper, which is available online: Reinforcement Learning in Continuous State and Action Spaces (by Hado van Hasselt and Marco A. Wiering).可以在以下在线论文中找到对不同方法的相当广泛的解释：连续 State 和动作空间中的强化学习（Hado van Hasselt 和 Marco A. Wiering）。

The common way of dealing with this problem is with actor-critic methods .处理这个问题的常用方法是使用actor-critic 方法。 These naturally extend to continuous action spaces.这些自然延伸到连续的行动空间。 Basic Q-learning could diverge when working with approximations, however, if you still want to use it, you can try combining it with a self-organizing map, as done in "Applications of the self-organising map to reinforcement learning" .使用近似值时，基本 Q 学习可能会出现偏差，但是，如果您仍想使用它，您可以尝试将其与自组织 map 结合，如“自组织 map 到强化学习的应用”中所做的那样。 The paper also contains some further references you might find useful.该论文还包含一些您可能会觉得有用的进一步参考资料。

For what you're doing I don't believe you need to work in continuous action spaces.对于您正在做的事情，我认为您不需要在连续的行动空间中工作。 Although the physical mouse moves in a continuous space, internally the cursor only moves in discrete steps (usually at pixel levels), so getting any precision above this threshold seems like it won't have any effect on your agent's performance.尽管物理鼠标在连续空间中移动，但在内部 cursor 仅以离散步长移动（通常在像素级别），因此任何超过此阈值的精度似乎不会对代理的性能产生任何影响。 The state space is still quite large, but it is finite and discrete. state 空间仍然相当大，但它是有限且离散的。

I know this post is somewhat old, but in 2016, a variant of Q-learning applied to continuous action spaces was proposed, as an alternative to actor-critic methods.我知道这篇文章有些陈旧，但在 2016 年，提出了一种应用于连续动作空间的 Q-learning 变体，作为演员批评方法的替代方案。 It is called normalized advantage functions (NAF).它被称为归一化优势函数（NAF）。 Here's the paper: Continuous Deep Q-Learning with Model-based Acceleration这是论文： Continuous Deep Q-Learning with Model-based Acceleration

Another paper to make the list, from the value-based school, is Input Convex Neural Networks .另一篇来自基于价值的学校的论文是Input Convex Neural Networks 。 The idea is to require Q(s,a) to be convex in actions (not necessarily in states).这个想法是要求 Q(s,a) 在动作中是凸的（不一定在状态中）。 Then, solving the argmax Q inference is reduced to finding the global optimum using the convexity, much faster than an exhaustive sweep and easier to implement than other value-based approaches.然后，求解 argmax Q 推断被简化为使用凸性找到全局最优值，比穷举扫描快得多，并且比其他基于值的方法更容易实现。 Yet, likely at the expense of a reduced representation power than usual feedforward or convolutional neural networks.然而，与通常的前馈或卷积神经网络相比，可能会以降低表示能力为代价。