简体繁体中英

How to choose action in TD(0) learning

原文 2017-07-21 07:23:02 2 1 reinforcement-learning/ temporal-difference

I am currently reading Sutton's Reinforcement Learning: An introduction book. After reading chapter 6.1 I wanted to implement a TD(0) RL algorithm for this setting:

To do this, I tried to implement the pseudo-code presented here:

Doing this I wondered how to do this step A <- action given by π for S : I can I choose the optimal action A for my current state S ? As the value function V(S) is just depending on the state and not on the action I do not really know, how this can be done.

I found this question (where I got the images from) which deals with the same exercise - but here the action is just picked randomly and not choosen by an action policy π .

Edit: Or this is pseudo-code not complete, so that I have to approximate the action-value function Q(s, a) in another way, too?

1 answers

You are right, you cannot choose an action (neither derive a policy π ) only from a value function V(s) because, as you notice, it depends only on the state s .

The key concept that you are probably missing here, it's that TD(0) learning is an algorithm to compute the value function of a given policy. Thus, you are assuming that your agent is following a known policy. In the case of the Random Walk problem, the policy consists in choosing actions randomly.

If you want to be able to learn a policy, you need to estimate the action-value function Q(s,a) . There exists several methods to learn Q(s,a) based on Temporal-difference learning, such as for example SARSA and Q-learning.

In the Sutton's RL book, the authors distinguish between two kind of problems: prediction problems and control problems. The former refers to the process of estimating the value function of a given policy, and the latter to estimate policies (often by means of action-value functions). You can find a reference to these concepts in the starting part of Chapter 6 :

As usual, we start by focusing on the policy evaluation or prediction problem, that of estimating the value function for a given policy . For the control problem (finding an optimal policy), DP, TD, and Monte Carlo methods all use some variation of generalized policy iteration (GPI). The differences in the methods are primarily differences in their approaches to the prediction problem.

TD learning vs Q learning

Reinforcement Learning-TD learning from afterstates

Rewards in Q-Learning and in TD(lambda)

Reinforcement Learning - How does an Agent know which action to pick?

How to handle changing input element numbers and multiple action in Reinforcement Learning?

Reinforcement Learning, how can I sample action from Gaussian distribution with action dimension space larger than one?

Deep reinforcement learning - how to deal with boundaries in action space

How can I apply reinforcement learning to continuous action spaces?

Limit on Action Change in reinforcement learning

Transfer Discrete action to Continuous action in Reinforcement Learning

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question TD learning vs Q learning Reinforcement Learning-TD learning from afterstates Rewards in Q-Learning and in TD(lambda) Reinforcement Learning - How does an Agent know which action to pick? How to handle changing input element numbers and multiple action in Reinforcement Learning? Reinforcement Learning, how can I sample action from Gaussian distribution with action dimension space larger than one? Deep reinforcement learning - how to deal with boundaries in action space How can I apply reinforcement learning to continuous action spaces? Limit on Action Change in reinforcement learning Transfer Discrete action to Continuous action in Reinforcement Learning

Related Tags

How to choose action in TD(0) learning

Question

1 answers

solution1 4 ACCPTED 2017-07-21 07:48:32

solution1
4 ACCPTED 2017-07-21 07:48:32