[英]Reinforcement Learning in arbitrarily large action/state spaces
I'm interested to use Deep Reinforcement Learning in order to find an - unique - optimal path back home among (too many) possibilities and a few (required) intermediate stopes (for instance, buy a coffee or refuel). 我对使用深度强化学习感兴趣,以便在(太多)可能性和一些(必需)中间采场(例如,购买咖啡或加油)中找到一条独特的最佳回家之路。
Furthermore, I want to apply this in cases where the agent doesn't know a “model” of the environment, and the agent can't try all possible combinations of states and actions at all. 此外,我想在代理不了解环境的“模型”并且代理完全无法尝试状态和动作的所有可能组合的情况下应用此方法。 Ie needing to use approximation techniques in Q-value function (and/or policy).
即需要在Q值函数(和/或策略)中使用近似技术。
I've read of methods for facing cases like this - where rewards, if any, are sparse and binary - like Monte Carlo Tree search (which implies some sort of modeling and planning, according to my understandings) or Hindsight Experience Replay (HER), applying ideas of DDPG. 我已经读过处理此类情况的方法-奖励(如果有的话)是稀疏的和二进制的-例如,蒙特卡洛树搜索(根据我的理解,这意味着某种建模和计划)或Hindsight Experience Replay(HER) ,运用DDPG的想法。
But there are so many different kind of algorithms to consider, I'm a bit confused what's best to begin with. 但是有太多不同种类的算法需要考虑,我对最好的开始有点困惑。 I know it's a difficult problem, and maybe it's too naive to ask this, but Is there any clear, direct and we'll-known way to solve the problem I want to face?
我知道这是一个棘手的问题,也许问这个问题太天真了,但是有没有明确,直接且我们熟知的方法来解决我要面对的问题?
Thanks a lot! 非常感谢!
Matias 马蒂亚斯
If the final destination is fixed as in this case(home) you can go for dynamic search as a* will not work due to changeable enviornment. 如果最终目的地是固定的(在这种情况下(家)),则可以进行动态搜索,因为a *由于环境变化而无法使用。 And if you want to use deep learning algorithm then go for a3c with experience replay due to the large action/state spaces.It capable of handling complex probelm.
如果您想使用深度学习算法,那么由于动作/状态空间较大,请选择具有重播经验的a3c,它能够处理复杂的探针。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.