简体繁体 English

增强学习以获取连续的状态和动作空间

[英]Reinforcement learning for continuous state and action space

原文 2019-01-05 11:24:35 5 2 python/ machine-learning/ artificial-intelligence/ reinforcement-learning

Problem 问题

My goal is to apply Reinforcement Learning to predict the next state of an object under a known force in a 3D environment (the approach would be reduced to supervised learning, off-line learning). 我的目标是应用强化学习来预测物体在3D环境下处于已知力下的下一个状态（该方法将简化为监督学习，离线学习）。

Details of my approach 我的方法的细节

The current state is the vector representing the position of the object in the environment (3 dimensions), and the velocity of the object (3 dimensions). 当前状态是代表物体在环境中的位置（3维）和物体的速度（3维）的向量。 The starting position is randomly initialized in the environment, as well as the starting velocity. 起始位置以及起始速度在环境中被随机初始化。

The action is the vector representing the movement from state t to state t+1 . 动作是代表从状态t到状态t + 1的运动的向量。

The reward is just the Euclidean distance between the predicted next state, and the real next state (I already have the target position). 奖励只是预测的下一个状态与实际的下一个状态（我已经有了目标位置）之间的欧几里得距离。

What have I done so far? 到目前为止，我做了什么？

I have been looking for many methods to do this. 我一直在寻找许多方法来做到这一点。 Deep Deterministic Policy Gradients works for a continuous action space, but in my case I also have a continuous state space. 深度确定性策略梯度适用于连续的操作空间，但就我而言，我也具有连续的状态空间。 If you are interested in this approach, here's the original paper written at DeepMind: http://proceedings.mlr.press/v32/silver14.pdf 如果您对此方法感兴趣，请参考DeepMind上的原始文章： http ： //proceedings.mlr.press/v32/silver14.pdf

The Actor-Critic approach should work, but it is usually (or always) applied to discrete and low-dimensional state space. Actor-Critic方法应该可行，但通常（或始终）应用于离散和低维状态空间。

Q-Learning and Deep-Q Learning cannot handle high dimensional state space, so my configuration would not work even if discretizing the state space. Q-Learning和Deep-Q Learning无法处理高维状态空间，因此即使离散化状态空间，我的配置也无法使用。

Inverse Reinforcement Learning (an instance of Imitation learning, with Behavioral Cloning and Direct Policy Learning ) approximates a reward function when finding the reward function is more complicated than finding the policy function. 反向强化学习 （模仿学习的一个实例，具有行为克隆和直接策略学习 ）在发现奖励功能比找到策略功能更为复杂时，可以近似奖励功能。 Interesting approach, but I haven't seen any implementation, and in my case the reward function is pretty straightforward. 有趣的方法，但是我还没有看到任何实现，对于我来说，奖励功能非常简单。 Is there a methodology to deal with my configuration that I haven't explored? 有没有可以解决我的配置问题的方法？

2 个解决方案

In your question, I believe there may be a lot of confusion and misconceptions. 在您的问题中，我相信可能会有很多困惑和误解。

Firstly, deep deterministic policy gradient (DDPG) can definitely handle continuous states and actions. 首先，深度确定性策略梯度（DDPG）绝对可以处理连续的状态和动作。 And it is so famous only because of it. 它之所以如此著名仅是因为它。 Also, it is the first ever stable architecture to do so. 而且，它是有史以来第一个稳定的架构。 Also, the paper you linked is actually DPG, not DDPG. 另外，您链接的纸张实际上是DPG，而不是DDPG。 However, DDPG and DPG can both handle continuous states and actions, but the latter is much more unstable. 但是，DDPG和DPG都可以处理连续的状态和动作，但是后者更加不稳定。 The paper is actually published by my "senior" at UofA. 该论文实际上是由我在UofA的“高级”发表的。 Heres the link to DDPG: https://arxiv.org/pdf/1509.02971.pdf . 这是DDPG的链接： https ://arxiv.org/pdf/1509.02971.pdf。
Actor-critic RL is not an algorithm, rather, its a family of RL algorithms where the actor maps states to actions, while the critic "pre-processes" the feedback signal so the actor can learn it more efficiently. 批评演员的RL不是算法，而是一个RL算法系列，其中演员将状态映射到动作，而评论家对反馈信号进行“预处理”，以便演员可以更有效地学习它。 DDPG is an example of an actor-critic set-up. DDPG是演员评判机构的一个例子。 In DDPG, a DQN is used as a critic to pre-process feedback signals to the deterministic policy gradient (actor). 在DDPG中，DQN用作批注者，以对确定性策略梯度（参与者）的反馈信号进行预处理。
Q-learning and deep Q-learning are also family of RL algorithms. Q学习和深度Q学习也是RL算法家族。 Q-learning certainly cannot handle high state spaces given inadequate computing power, however, deep Q-learning certainly can. 鉴于计算能力不足，Q学习当然不能处理高状态空间，但是，深度Q学习当然可以。 An example is Deep Q-network. 深度Q网络就是一个例子。

Back to the original question. 回到原来的问题。

I can almost guarantee that you can solve your problem using DDPG. 我几乎可以保证您可以使用DDPG解决您的问题。 In fact, DDPG is still one of the only algorithms that can be used to control an agent in a continuous state, continuous action space. 实际上，DDPG仍然是可用于在连续状态，连续动作空间中控制代理的仅有算法之一。

The other method that can do so is called trust region policy optimization (TRPO). 可以这样做的另一种方法称为信任区域策略优化（TRPO）。 It is developed by the UC Bekelery team (along with OpenAI?). 它是由UC Bekelery团队（以及OpenAI？）开发的。 The fundamental structure of TRPO and DDPG are identical (both actor-critic), however, the training is different. TRPO和DDPG的基本结构是相同的（都是演员评判的），但是培训是不同的。 DDPG uses a target network approach to guarantee convergence and stability while TRPO puts a Kullerback-Leibler divergence constraint on the update of the networks to ensure each update of the network is not too large (ie optimal policy of the network at t is not too different from t - 1). DDPG使用目标网络方法来确保收敛和稳定性，而TRPO在网络更新上施加Kullerback-Leibler散度约束，以确保网络的每次更新都不会太大（即，在t时刻网络的最佳策略不会有太大差异）从t-1开始）。 TRPO is extremely difficult to code, thus, OpenAI published another paper called Proximal Policy Gradient (PPO). TRPO非常难以编码，因此，OpenAI发表了另一篇名为《近端策略梯度》（PPO）的论文。 This method is similar to TRPO but easier to implement. 此方法类似于TRPO，但更易于实现。

Long story short, I'd recommend trying DDPG because if your task is simple as you say, DDPG will definitely work. 长话短说，我建议您尝试DDPG，因为如果您的任务很简单，那么DDPG肯定可以工作。

Seems like this can be a good paper to look at. 看起来这可能是一篇不错的论文。 If action space is not discretized then it's difficult to specify or select an action from the huge continuous state space. 如果动作空间不离散化，那么很难从巨大的连续状态空间中指定或选择一个动作。 When the action space is discretized, it could lead to a significant loss of information. 当行动空间离散时，可能导致大量信息丢失。 Presented algorithm starts from actions initialized by the policy network in the discretized space. 提出的算法从策略网络在离散空间中初始化的动作开始。 Then it explores and evaluates actions with the value network using an Upper Confidence Bound (UCB) in the continuous space. 然后，它使用连续空间中的上限可信度（UCB）探索和评估价值网络的行为。

There are couple of other papers might be worth looking at however above paper is most recent one. 还有其他几篇论文可能值得一看，但是以上论文是最新的。 Hopefully this helps. 希望这会有所帮助。