简体繁体 English

强化学习中的状态依赖动作集

[英]State dependent action set in reinforcement learning

原文 2018-04-25 00:07:08 3 4 machine-learning/ reinforcement-learning/ q-learning

How do people deal with problems where the legal actions in different states are different?人们如何处理不同州法律行为不同的问题？ In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.在我的情况下，我总共有大约 10 个行动，法律行动不重叠，这意味着在某些州，相同的 3 个州始终是合法的，而这些州在其他类型的州中从不合法。

I'm also interested in see if the solutions would be different if the legal actions were overlapping.我也有兴趣看看如果法律行动重叠，解决方案是否会有所不同。

For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value.对于 Q 学习（我的网络为我提供状态/动作对的值），我想也许我可以在构建目标值时小心选择哪个 Q 值。 (ie instead of choosing the max, I choose the max among legal actions...) （即，我没有选择最大值，而是在法律诉讼中选择最大值......）

For Policy-Gradient type of methods I'm less sure of what the appropriate setup is.对于 Policy-Gradient 类型的方法，我不太确定合适的设置是什么。 Is it okay to just mask the output layer when computing the loss?计算损失时只屏蔽输出层可以吗？

4 个解决方案

There are two closely related works in recent two years:近两年有两部密切相关的作品：

[1] Boutilier, Craig, et al. [1] 布蒂利埃、克雷格等。 "Planning and learning with stochastic action sets." “使用随机动作集进行规划和学习。” arXiv preprint arXiv:1805.02363 (2018). arXiv 预印本 arXiv:1805.02363 (2018)。

[2] Chandak, Yash, et al. [2] Chandak、Yash 等人。 "Reinforcement Learning When All Actions Are Not Always Available." “当并非所有操作都可用时的强化学习。” AAAI. AAAI。 2020. 2020。

Currently this problem seems to not have one, universal and straight-forward answer.目前这个问题似乎没有一个通用且直接的答案。 Maybe because it is not that of an issue?也许是因为这不是问题？

Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this.您为法律诉讼选择最佳 Q 值的建议实际上是处理此问题的建议方法之一。 For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.对于策略梯度方法，您可以通过屏蔽非法行为并适当地扩大其他行为的概率来获得类似的结果。

Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before.其他方法是对选择非法行为给予负面奖励——或者忽略选择而不对环境进行任何改变，返回与以前相同的奖励。 For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time.对于我的个人经历之一（Q 学习方法），我选择了后者，代理学习了他必须学习的内容，但他不时将非法行为用作“无行为”行为。 It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.这对我来说并不是真正的问题，但负奖励可能会消除这种行为。

As you see, these solutions don't change or differ when the actions are 'overlapping'.如您所见，当操作“重叠”时，这些解决方案不会改变或不同。

Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules.回答您在评论中提出的问题 - 我不相信您可以在不了解合法/非法行为规则的情况下在描述的条件下培训代理。 This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).例如，这需要为每组法律行动设置单独的网络，这听起来并不是最好的主意（特别是如果有很多可能的法律行动组）。

But is the learning of these rules hard?但是这些规则的学习难吗？

You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate?您必须自己回答一些问题 -使行为非法、难以表达/清晰表达的条件是什么？ It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training.当然，它是特定于环境的，但我会说大部分时间表达并不难，代理只是在训练期间学习它们。 If it is hard, does your environment provide enough information about the state?如果很难，您的环境是否提供有关状态的足够信息？

Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value).不确定我是否正确理解您的问题，但如果您的意思是在某些状态下某些操作是不可能的，那么您只需将其反映在奖励函数中（大负值）。 You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.如果不清楚非法行为会导致什么状态，您甚至可以决定结束这一事件。然后代理应该知道这些行为在特定状态下是不可取的。

In exploration mode, the agent might still choose to take the illegal actions.在探索模式下，代理可能仍会选择采取非法行动。 However, in exploitation mode it should avoid them.但是，在开发模式下它应该避免它们。

I recently built a DDQ agent for connect-four and had to address this.我最近为连接四构建了一个 DDQ 代理，并且不得不解决这个问题。 Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.就我而言，这是 -100，效果很好。

In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player.在连接四中，允许非法移动（有效地跳过一个回合）在某些情况下对玩家有利。 This is why I set the reward equivalent to losing and not a smaller negative number.这就是为什么我将奖励设置为等于失败而不是较小的负数。

So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.因此，如果您将负奖励设置为大于失败，则您必须在您的领域中考虑允许非法移动在探索中发生的含义。