简体   繁体   中英

Is Monte Carlo learning policy or value iteration (or something else)?

I am taking a Reinforcement Learning class and I didn't understand how to combine the concepts of policy iteration/value iteration with Monte Carlo (and also TD/SARSA/Q-learning). In the table below, how can the empty cells be filled: Should/can it be binary yes/no, some string description or is it more complicated?

在此处输入图片说明

Value iteration and policy iteration are model-based methods of finding an optimal policy. They try to construct the Markov decision process (MDP) of the environment. The main premise behind reinforcement learning is that you don't need the MDP of an environment to find an optimal policy, and traditionally value iteration and policy iteration are not considered RL (although understanding them is key to RL concepts). Value iteration and policy iteration learn "indirectly" because they form a model of the environment and can then extract the optimal policy from that model.

"Direct" learning methods do not attempt to construct a model of the environment. They might search for an optimal policy in the policy space or utilize value function-based (aka "value based") learning methods. Most approaches you'll learn about these days tend to be value function-based.

Within value function-based methods, there are two primary types of reinforcement learning methods:

  • Policy iteration-based methods
  • Value iteration-based methods

Your homework is asking you, for each of those RL methods, if they are based on policy iteration or value iteration.

A hint: one of those five RL methods is not like the others.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM