简体   繁体   中英

How to choose action in TD(0) learning

I am currently reading Sutton's Reinforcement Learning: An introduction book. After reading chapter 6.1 I wanted to implement a TD(0) RL algorithm for this setting:

在此输入图像描述 在此输入图像描述

To do this, I tried to implement the pseudo-code presented here: 在此输入图像描述

Doing this I wondered how to do this step A <- action given by π for S : I can I choose the optimal action A for my current state S ? As the value function V(S) is just depending on the state and not on the action I do not really know, how this can be done.

I found this question (where I got the images from) which deals with the same exercise - but here the action is just picked randomly and not choosen by an action policy π .

Edit: Or this is pseudo-code not complete, so that I have to approximate the action-value function Q(s, a) in another way, too?

You are right, you cannot choose an action (neither derive a policy π ) only from a value function V(s) because, as you notice, it depends only on the state s .

The key concept that you are probably missing here, it's that TD(0) learning is an algorithm to compute the value function of a given policy. Thus, you are assuming that your agent is following a known policy. In the case of the Random Walk problem, the policy consists in choosing actions randomly.

If you want to be able to learn a policy, you need to estimate the action-value function Q(s,a) . There exists several methods to learn Q(s,a) based on Temporal-difference learning, such as for example SARSA and Q-learning.

In the Sutton's RL book, the authors distinguish between two kind of problems: prediction problems and control problems. The former refers to the process of estimating the value function of a given policy, and the latter to estimate policies (often by means of action-value functions). You can find a reference to these concepts in the starting part of Chapter 6 :

As usual, we start by focusing on the policy evaluation or prediction problem, that of estimating the value function for a given policy . For the control problem (finding an optimal policy), DP, TD, and Monte Carlo methods all use some variation of generalized policy iteration (GPI). The differences in the methods are primarily differences in their approaches to the prediction problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM