简体   繁体   中英

Reinforcement Learning-TD learning from afterstates

I'm making a program that teaches 2 players to play a simple board game using Reinforcement Learning and the Temporal Difference learning method (TD(λ) ) based on afterstates. Learning occurs by training a neural network. I use Sutton's NonLinear TD/Backprop neural network ) I would really like your opinion on my following dilemma . The basic algorithm/pseudo code for playing a turn between the two opponents is this

WHITE.CHOOSE_ACTION(GAME_STATE); //White player decides on its next move by evaluating the current game state ( TD(λ) learning)

GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);  //We apply the chosen action of the player to the environment and a new game state emerges

 IF (GAME STATE != FINAL ){ // If the new state is not final (not a winning state for white player), do the same for the Black player

    BLACK.CHOOSE_ACTION(GAME_STATE)

GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION) // We apply the chosen action of the black player to the environment and a new game state emerges.
}

When should each player invoke his learning method PLAYER.LEARN(GAME_STATE). Here is the dillema.

OPTION A. Immediately after each player's move, after the new afterstate emerges, as follows:

WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)    // White learns from the afterstate that emerged right after his action
IF (GAME STATE != FINAL ){
    BLACK.CHOOSE_ACTION(GAME_STATE)
    GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
    BLACK.LEARN(GAME_STATE) // Black learns from the afterstate that emerged right after his action

OPTION B. Immediately after each player's move, after the new afterstate emerges, but also after the opponents move, if the opponent makes a winning move.

WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)
IF (GAME_STATE == FINAL ) //If white player won
    BLACK.LEARN(GAME_STATE) // Make the Black player learn from the White player's winning afterstate
IF (GAME STATE != FINAL ){ //If white player's move did not produce a winning/final afterstate
    BLACK.CHOOSE_ACTION(GAME_STATE)
    GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
    BLACK.LEARN(GAME_STATE)
    IF (GAME_STATE == FINAL) //If Black player won
        WHITE.LEARN(GAME_STATE) //Make the White player learn from the Black player's winning afterstate

I believe that option B is more reasonable.

Typically, with TD learning, an agent would have 3 functions:

  • start(observation) → action
  • step(observation, reward) → action
  • finish(reward)

Action is combined with learning, and a bit more learning also occurs when the game ends.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM