简体繁体 English

加强对大型国家空间中多个参与者的政策的学习

[英]Reinforcement learning of a policy for multiple actors in large state spaces

原文 2012-01-24 15:01:47 8 1 machine-learning/ reinforcement-learning

I have a real-time domain where I need to assign an action to N actors involving moving one of O objects to one of L locations. 我有一个实时域，我需要为N个actor分配一个动作，包括将一个O对象移动到L个位置之一。 At each time step, I'm given a reward R, indicating the overall success of all actors. 在每个时间步，我都会得到奖励R，表明所有演员的整体成功。

I have 10 actors, 50 unique objects, and 1000 locations, so for each actor I have to select from 500000 possible actions. 我有10个演员，50个独特的对象和1000个位置，因此对于每个演员，我必须从500000个可能的动作中进行选择。 Additionally, there are 50 environmental factors I may take into account, such as how close each object is to a wall, or how close it is to an actor. 此外，我可能会考虑50个环境因素，例如每个物体与墙壁的接近程度，或者它与演员的接近程度。 This results in 25000000 potential actions per actor . 这导致每个演员有 2500万个潜在动作。

Nearly all reinforcement learning algorithms don't seem to be suitable for this domain. 几乎所有的强化学习算法似乎都不适合这个领域。

First, they nearly all involve evaluating the expected utility of each action in a given state. 首先，它们几乎都涉及评估给定状态下每个动作的预期效用。 My state space is huge, so it would take forever to converge a policy using something as primitive as Q-learning, even if I used function approximation. 我的状态空间是巨大的，所以即使我使用函数逼近，使用像Q学习这样原始的东西来收集策略也需要永远。 Even if I could, it would take too long to find the best action out of a million actions in each time step. 即使我可以，在每个时间步骤中从一百万个动作中找到最佳动作也需要很长时间。

Secondly, most algorithms assume a single reward per actor, whereas the reward I'm given might be polluted by the mistakes of one or more actors. 其次，大多数算法假设每个演员获得一个奖励，而我给予的奖励可能会被一个或多个演员的错误所污染。

How should I approach this problem? 我该如何处理这个问题？ I've found no code for domains like this, and the few academic papers I've found on multi-actor reinforcement learning algorithms don't provide nearly enough detail to reproduce the proposed algorithm. 我没有找到像这样的域名的代码，我在多演员强化学习算法上发现的少数学术论文没有提供足够的细节来重现所提出的算法。

1 个解决方案

Clarifying the problem 澄清问题

N=10 actors N = 10名演员
O=50 objects O = 50个对象
L=1K locations L = 1K位置
S=50 features S = 50个功能

As I understand it, you have a warehouse with N actors, O objects, L locations, and some walls. 据我了解，你有一个仓库，有N个演员，O个对象，L个位置和一些墙。 The goal is to make sure that each of the O objects ends up in any one of the L locations in the least amount of time. 目标是确保每个O对象在最短的时间内到达L个位置中的任何一个。 The action space consist of decisions on which actor should be moving which object to which location at any point in time. 动作空间包括决定哪个actor应该移动哪个对象在任何时间点到哪个位置。 The state space consists of some 50 X-dimensional environmental factors that include features such as proximity of actors and objects to walls and to each other. 状态空间由大约50个X维环境因素组成，其中包括诸如演员和物体与墙壁之间的接近以及彼此之间的特征。 So, at first glance, you have X ^S (OL) ^N action values, with most action dimensions discrete. 因此，乍一看，您有X ^S （OL） ^N个动作值，大多数动作维度都是离散的。

The problem as stated is not a good candidate for reinforcement learning. 所述问题不是强化学习的好选择。 However, it is unclear what the environmental factors really are and how many of the restrictions are self-imposed. 然而，尚不清楚环境因素究竟是什么以及有多少限制是自我强加的。 So, let's look at a related, but different problem. 那么，让我们来看一个相关但不同的问题。

Solving a different problem 解决不同的问题

We look at a single actor. 我们看一个演员。 Say, it knows it's own position in the warehouse, positions of the other 9 actors, positions of the 50 objects, and 1000 locations. 比如说，它知道它在仓库中的位置，其他9个角色的位置，50个物体的位置以及1000个位置。 It wants to achieve maximum reward, which happens when each of the 50 objects is at one of the 1000 locations. 它希望获得最大的奖励，当50个对象中的每一个都在1000个位置之一时发生。

Suppose, we have a P-dimensional representation of position in the warehouse. 假设，我们在仓库中有一个P维表示。 Each position could be occupied by the actor in focus, one of the other actors, an object, or a location. 每个位置都可以由焦点中的演员，其他演员，对象或位置占据。 The action is to choose an object and a location. 动作是选择一个对象和一个位置。 Therefore, we have a 4 ^P -dimensional state space and a P ² -dimensional action space. 因此，我们有一个4 ^P维状态空间和一个P ²维动作空间。 In other words, we have a 4 ^P P ² -dimensional value function. 换句话说，我们有一个4 ^P P ²维值函数。 By futher experimenting with representation, using different-precision encoding for different parameters, and using options 2 , it might be possible to bring the problem into the practical realm. 通过进一步尝试表示，对不同参数使用不同精度编码，并使用选项 2 ，可能将问题带入实际领域。

For examples of learning in complicated spatial settings, I would recommend reading the Konidaris papers 1 and 2 . 有关在复杂的空间环境中学习的例子，我建议阅读Konidaris论文1和2 。

1 Konidaris, G., Osentoski, S. & Thomas, P., 2008. Value function approximation in reinforcement learning using the Fourier basis. 1 Konidaris，G.，Osentoski，S。和Thomas，P.，2008。 使用傅立叶基础的强化学习中的值函数逼近。 Computer Science Department Faculty Publication Series, p.101. 计算机科学系教师出版系列，第101页。

2 Konidaris, G. & Barto, A., 2009. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining Y. Bengio et al., eds. 2 Konidaris，G。＆Barto，A.，2009。 使用技能链进行持续强化学习领域的技能发现Y. Bengio等，编辑。 Advances in Neural Information Processing Systems, 18, pp.1015-1023. 神经信息处理系统的进展，18，pp.1015-1023。