简体   繁体   中英

Why the bandit problem is also called a one-step/state MDP in Reinforcement learning?

1 步/状态 MDP(马尔可夫决策过程)是什么意思?

Let us consider a n action 1 state MDP. Regardless of which action you take, you are going to stay in the same state. You will, though, get a reward that depends only on the action you took. If you wish to maximise the long term reward in this setting, what you need to do is just judge which of n available choices (actions) is the best.

This is exactly what the bandit problem is.

In bandit the past pulls of levers do not affect what the lever will output or the reward.

The reward is only dependent on which lever is pulled and nothing in the past.

So there is only one state.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM