简体   繁体   中英

What's the point of using Temporal difference learning at all?

As far as I know, for a specific policy \\pi, temporal difference learning let us compute the expected value following that policy \\pi, but what's the meaning of knowing a specific policy?

Shouldn't we try finding the optimal policy for a given environment? What's the point of doing a specific \\pi using temporal difference learning at all?

As you said, only finding the value function for a given policy is not very useful in the general case, where the goal is finding an optimal policy. However, several classical algorithms such as SARSA or Q-learning , can ve viewed as a special case of generalized policy iteration , where the most difficult part is finding the value function of a policy. Once you know the value function, it's easy to find a better policy, then find again the value function of the recently computed policy, and so on. This process, given some conditions, converges to the optimal policy.

In summary, temporal difference learning is a key step in other algorthims that allow to find an optimal policy.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM