简体   繁体   English

强化学习中如何计算价值函数

[英]How to calculate the value function in reinforcement learning

Could anybody help to explain how to following value function been generated, the problem and solution are attached, I just don't know how the solution is generated. 任何人都可以帮助解释如何生成值函数,将问题和解决方案附加在一起,我只是不知道解决方案是如何生成的。 thank you! 谢谢! 问题

解

STILL NEED HELP WITH THIS!!! 仍然需要帮助!!!

Since no one else has taken a stab at it, I'll present my understanding of the problem (disclaimer: I'm not an expert on reinforced learning and I'm posting this as an answer because it's too long to be a comment) 由于没有其他人对此表示怀疑,因此,我将介绍我对该问题的理解(免责声明:我不是强化学习的专家,我将其发布为答案,因为它太长了,无法发表评论)

Think of it this way: when starting at, for example, node d, a random walker has a 50% chance to jump to either node e or node a. 这样想:在例如从节点d开始时,随机游走者有50%的机会跳到节点e或节点a。 Each such jump reduces the reward (r) with the multiplier y (gamma in the picture). 每次这样的跳跃都会使乘数y(图片中的伽马)减小奖励(r)。 You continue jumping around until you get to the target node (f in this case), after which you collect the reward r. 您继续跳来跳去,直到到达目标节点(在本例中为f),之后您便获得了奖励r。

If I've understood correctly, the two smaller 3x2 squares represent the expected values of reward when starting from each node. 如果我理解正确,则从每个节点开始时,两个较小的3x2正方形表示期望的奖励值。 Now, it's obvious why in the first 3x2 square every node has a value of 100: because y = 1, the reward never decreases. 现在,很明显为什么在第一个3x2正方形中每个节点的值都是100:因为y = 1,所以奖励永远不会减少。 You can just keep jumping around until you eventually end up in the reward node, and gather the reward of r=100. 您可以一直跳来跳去,直到最终到达奖励节点,并收集r = 100的奖励。

However, in the second 3x2 square, with every jump the reward is decreased with a multiplier of 0.9. 但是,在第二个3x2正方形中,每跳一次,奖励就会以0.9的乘数降低。 So, to get the expected value of reward when starting from square c, you sum together the reward you get from different paths, multiplied by their probabilities. 因此,要获得从平方c开始的奖励期望值,请将通过不同途径获得的奖励相加,并乘以它们的概率。 Going from c to f has a chance of 50% and it's 1 jump, so you get r = 0.5*0.9^0*100 = 50. Then there's the path cbef: 0.5*(1/3)*(1/3)*0.9^2*100 = 4.5. 从c到f的几率是50%,它跳了1次,所以您得到r = 0.5 * 0.9 ^ 0 * 100 =50。然后是路径cbef:0.5 *(1/3)*(1/3) * 0.9 ^ 2 * 100 = 4.5。 Then there's cbcf: 0.9^2*0.5^2*(1/3)^1*100 = 6.75. 然后是cbcf:0.9 ^ 2 * 0.5 ^ 2 *(1/3)^ 1 * 100 = 6.75。 You keep going this way until the reward from the path you're examining is insignificantly small, and sum together the rewards from all the paths. 您将一直按照这种方式进行操作,直到所检查路径的收益很小,然后将所有路径的收益汇总在一起。 This should give you the result of the corresponding node, that is, 50+6.75+4.5+... = 76. 这应该为您提供相应节点的结果,即50 + 6.75 + 4.5 + ... = 76。

I guess the programmatic way of doing to would be to use a modified dfs/bfs to explore all the paths of length N or less, and sum together the rewards from those paths (N chosen so that 0.9^N is small). 我猜想这样做的编程方式是使用修改的dfs / bfs探索长度为N或更短的所有路径,并将这些路径的收益加在一起(选择N使得0.9 ^ N小)。

Again, take this with a grain of salt; 再加上一点盐。 I'm not an expert on reinforced learning. 我不是强化学习专家。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM