简体   繁体   中英

SARSA in Reinforcement Learning

I am coming across the SARSA algorithm in model-free reinforcement learning. Specifically, in each state, you would take an action a , and then observed a new state s' .

My question is, if you don't have the state transition probability equation P{next state | current state = s0} P{next state | current state = s0} , how do you know what your next state will be?

My attempt : do you simply try that action a out, and then observe from the enviroment? 在此处输入图片说明

通常,是的,您在环境中执行操作,环境会告诉您下一个状态是什么。

Yes. Based on the agent experience, stored in an action-value function, his behavior policy pi maps the current state s in an action a that leads him to a next state s' and then to a next action a' .

Fluxogram of state-action pairs sequences.

A technique called TD-Learning is used in Q-learning and SARSA to avoid learning the transition probabilities.

In short, when you are sampling, ie interacting with the system, and collecting data samples, (state, action, reward, next state, next action), in SARSA, the transition probabilities are implicitly considered when you use samples to update the parameters of your model. For example, every time you choose an action at the current state and then you get a reward and the new state, the system, in fact, generated the reward and the new state according to the transition probability p(s', r| a, s).

You can find a simple description in this book,

Artificial Intelligence A Modern Approach

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM