简体   繁体   中英

TD learning vs Q learning

In a perfect information environment, where we are able to know the state after an action, like playing chess, is there any reason to use Q learning not TD (temporal difference) learning?

As far as I understand, TD learning will try to learn V(state) value, but Q learning will learn Q(state action value) value, which means Q learning learns slower (as state action combination is more than state only), is that correct?

Q-Learning is a TD (temporal difference) learning method.

I think you are trying to refer to TD(0) vs Q-learning.

I would say it depends on your actions being deterministic or not. Even if you have the transition function, it can be expensive to decide which action to take in TD(0) as you need to calculate the expected value for each of the actions in each step. In Q-learning that would be summarized in the Q-value.

Given a deterministic environment (or as you say, a "perfect" environment in which you are able to know the state after performing an action), I guess you can simulate the affect of all possible actions in a given state (ie, compute all possible next states), and choose the action that achieves the next state with the maximum value V(state).

However,it should be taken into account that both value functions V(state) and Q functions Q(state,action) are defined for a given policy. In some way, the value function can be considered as an average of the Q function, in the sense that V(s) "evaluates" the state s for all possible actions. So, to compute a good estimation of V(s) the agent still needs to perform all the possible actions in s.

In conclusion, I think that although V(s) is simpler than Q(s,a), likely they need a similar quantity of experience (or time) to achieve a stable estimation.

You can find more info about value (V and Q) functions in this section of the Sutton & Barto RL book.

Q learning is a TD control algorithm, this means it tries to give you an optimal policy as you said. TD learning is more general in the sense that can include control algorithms and also only prediction methods of V for a fixed policy.

Actually Q-learning is the process of using state-action pairs instead of just states. But that doesnt mean Q learning is different from TD. In TD(0) our agent takes one step(which could be one step in state-action pair or just state) and then updates it's Q-value. And same in n-step TD where our agent takes n steps and then updates the Q-values. Comparing TD and Q-learning isn't the right way. You can compare TD and SARSA algorithms instead. And TD and MonteCarlo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM