简体   繁体   中英

Java maze solving and reinforcement learning

I'm writing code to automate simulate the actions of both Theseus and the Minoutaur as shown in this logic game; http://www.logicmazes.com/theseus.html

For each maze I provide it with the positions of the maze, and which positions are available eg from position 0 the next states are 1,2 or stay on 0. I run a QLearning instantiation which calculates the best path for theseus to escape the maze assuming no minotaur . then the minotaur is introduced. Theseus makes his first move towards the exit and is inevitably caught, resulting in reweighting of the best path. using maze 3 in the game as a test, this approach led to theseus moving up and down on the middle line indefinatly as this was the only moves that didnt get it killed.

As per a suggestion recieved here within the last few days i adjusted my code to consider state to be both the position of thesesus and the minotaur at a given time. when theseus would move the state would be added to a list of "visited states".By comparing the state resulting from the suggested move to the list of visited states, I am able to ensure that theseus would not make a move that would result in a previous state.

The problem is i need to be able to revisit in some cases. Eg using maze 3 as example and minotaur moving 2x for every theseus move. Theseus move 4 -> 5, state added(t5, m1). mino move 1->5. Theseus caught, reset. 4-> 5 is a bad move so theseus moves 4->3, mino catches on his turn. now both(t5, m1) and (t3 m1) are on the visited list

what happens is all possible states from the initial state get added to the dont visit list, meaning that my code loops indefinitly and cannot provide a solution.

public void move()
{
    int randomness =10;
    State tempState = new State();
    boolean rejectMove = true;
    int keepCurrent = currentPosition;
    int keepMinotaur = minotaurPosition;

    previousPosition = currentPosition;
    do
    {
        minotaurPosition = keepMinotaur;
        currentPosition = keepCurrent;
        rejectMove = false;

        if (states.size() > 10)
        {
            states.clear();
        }


        if(this.policy(currentPosition) == this.minotaurPosition )
        {
            randomness = 100;
        }

        if(Math.random()*100 <= randomness)
        {
            System.out.println("Random move");
            int[] actionsFromState = actions[currentPosition];
            int max = actionsFromState.length;
            Random r = new Random();
            int s =  r.nextInt(max);    

            previousPosition = currentPosition;
            currentPosition = actions[currentPosition][s];
        }
        else
        {
            previousPosition = currentPosition;
            currentPosition = policy(currentPosition);
        }

        tempState.setAttributes(minotaurPosition, currentPosition);
        randomness = 10;    

        for(int i=0; i<states.size(); i++)
        {
            if(states.get(i).getMinotaurPosition() == tempState.getMinotaurPosition()  &&  states.get(i).theseusPosition == tempState.getTheseusPosition())
            {

                rejectMove = true;

                changeReward(100);

            }
        }

    }
    while(rejectMove == true);

    states.add(tempState);
}       

above is the move method of theseus; showing it occasionally suggesting a random move

The problem here is a discrepancy between the "never visit a state you've previously been in" approach and your "reinforcement learning" approach. When I recommended the "never visit a state you've previously been in" approach, I was making the assumption that you were using backtracking: once Theseus got caught, you would unwind the stack to the last place where he made an unforced choice, and then try a different option. (That is, I assumed you were using a simple depth-first-search of the state-space.) In that sort of approach, there's never any reason to visit a state you've previously visited.

For your "reinforcement learning" approach, where you're completely resetting the maze every time Theseus gets caught, you'll need to change that. I suppose you can change the "never visit a state you've previously been in" rule to a two-pronged rule:

  • never visit a state you've been in during this run of the maze. (This is to prevent infinite loops.)
  • disprefer visiting a state you've been in during a run of the maze where Theseus got caught. (This is the "learning" part: if a choice has previously worked out poorly, it should be made less often.)

For what is worth, the simplest way to solve this problem optimally is to use ALPHA-BETA , which is a search algorithm for deterministic two-player games (like tic-tac-toe, checkers, chess). Here's a summary of how to implement it for your case:

  1. Create a class that represents the current state of the game, which should include: Thesesus's position, the Minoutaur's position and whose turn is it . Say you call this class GameState

  2. Create a heuristic function that takes an instance of GameState as paraemter, and returns a double that's calculated as follows:

    • Let Dt be the Manhattan distance (number of squares) that Theseus is from the exit.

    • Let Dm be the Manhattan distance (number of squares) that the Minotaur is from Theseus.

    • Let T be 1 if it's Theseus turn and -1 if it's the Minotaur's.

    • If Dm is not zero and Dt is not zero, return Dm + (Dt/2) * T

    • If Dm is zero, return -Infinity * T

    • If Dt is zero, return Infinity * T

The heuristic function above returns the value that Wikipedia refers to as "the heuristic value of node" for a given GameState (node) in the pseudocode of the algorithm.

You now have all the elements to code it in Java.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM