简体繁体中英

TD learning vs Q learning

原文 2016-02-26 11:29:54 9 4 machine-learning/ reinforcement-learning/ q-learning/ temporal-difference

In a perfect information environment, where we are able to know the state after an action, like playing chess, is there any reason to use Q learning not TD (temporal difference) learning?

As far as I understand, TD learning will try to learn V(state) value, but Q learning will learn Q(state action value) value, which means Q learning learns slower (as state action combination is more than state only), is that correct?

4 answers

Q-Learning is a TD (temporal difference) learning method.

I think you are trying to refer to TD(0) vs Q-learning.

I would say it depends on your actions being deterministic or not. Even if you have the transition function, it can be expensive to decide which action to take in TD(0) as you need to calculate the expected value for each of the actions in each step. In Q-learning that would be summarized in the Q-value.

Given a deterministic environment (or as you say, a "perfect" environment in which you are able to know the state after performing an action), I guess you can simulate the affect of all possible actions in a given state (ie, compute all possible next states), and choose the action that achieves the next state with the maximum value V(state).

However,it should be taken into account that both value functions V(state) and Q functions Q(state,action) are defined for a given policy. In some way, the value function can be considered as an average of the Q function, in the sense that V(s) "evaluates" the state s for all possible actions. So, to compute a good estimation of V(s) the agent still needs to perform all the possible actions in s.

In conclusion, I think that although V(s) is simpler than Q(s,a), likely they need a similar quantity of experience (or time) to achieve a stable estimation.

You can find more info about value (V and Q) functions in this section of the Sutton & Barto RL book.

Q learning is a TD control algorithm, this means it tries to give you an optimal policy as you said. TD learning is more general in the sense that can include control algorithms and also only prediction methods of V for a fixed policy.

Actually Q-learning is the process of using state-action pairs instead of just states. But that doesnt mean Q learning is different from TD. In TD(0) our agent takes one step(which could be one step in state-action pair or just state) and then updates it's Q-value. And same in n-step TD where our agent takes n steps and then updates the Q-values. Comparing TD and Q-learning isn't the right way. You can compare TD and SARSA algorithms instead. And TD and MonteCarlo

Q-learning vs dynamic programming

Learning rate of a Q learning agent

Q-learning vs temporal-difference vs model-based reinforcement learning

Speedy Q-Learning

Q learning transition matrix

Q Learning tutorial confuse

Reinforcement Learning-TD learning from afterstates

Metric Learning vs Similarity Learning

Dyna-Q with planning vs. n-step Q-learning

Q-Learning: Inaccurate predictions

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Q-learning vs dynamic programming Learning rate of a Q learning agent Q-learning vs temporal-difference vs model-based reinforcement learning Speedy Q-Learning Q learning transition matrix Q Learning tutorial confuse Reinforcement Learning-TD learning from afterstates Metric Learning vs Similarity Learning Dyna-Q with planning vs. n-step Q-learning Q-Learning: Inaccurate predictions

Related Tags

TD learning vs Q learning

Question

4 answers

solution1
1 2016-02-26 11:49:41

solution2
1 2016-03-01 09:48:38

solution3
0 2020-02-04 09:53:31

solution4
0 2020-07-31 13:34:31

TD learning vs Q learning

Question

4 answers

solution1 1 2016-02-26 11:49:41

solution2 1 2016-03-01 09:48:38

solution3 0 2020-02-04 09:53:31

solution4 0 2020-07-31 13:34:31

solution1
1 2016-02-26 11:49:41

solution2
1 2016-03-01 09:48:38

solution3
0 2020-02-04 09:53:31

solution4
0 2020-07-31 13:34:31