简体   繁体   中英

Q-learning to learn minesweeping behavior

I am attempting to use Q-learning to learn minesweeping behavior on a discreet version of Mat Buckland's smart sweepers, the original available here http://www.ai-junkie.com/ann/evolved/nnt1.html , for an assignment. The assignment limits us to 50 iterations of 2000 moves on a grid that is effectively 40x40, with the mines resetting and the agent being spawned in a random location each iteration.

I've attempted performing q learning with penalties for moving, rewards for sweeping mines and penalties for not hitting a mine. The sweeper agent seems unable to learn how to sweep mines effectively within the 50 iterations because it learns that going to specific cell is good, but after a the mine is gone it is no longer rewarded, but penalized for going to that cell with the movement cost

I wanted to attempt providing rewards only when all the mines were cleared in an attempt to make the environment static as there would only be a state of not all mines collected, or all mines collected, but am struggling to implement this due to the agent having only 2000 moves per iteration and being able to backtrack, it never manages to sweep all the mines in an iteration within the limit with or without rewards for collecting mines.

Another idea I had was to have an effectively new Q matrix for each mine, so once a mine is collected, the sweeper transitions to that matrix and operates off that where the current mine is excluded from consideration.

Are there any better approaches that I can take with this, or perhaps more practical tweaks to my own approach that I can try?

A more explicit explanation of the rules:

  • The map edges wrap around, so moving off the right edge of the map will cause the bot to appear on the left edge etc.
  • The sweeper bot can move up down, left or right from any map tile.
  • When the bot collides with a mine, the mine is considered swept and then removed.
  • The aim is for the bot to learn to sweep all mines on the map from any starting position.

Given that the sweeper can always see the nearest mine, this should be pretty easy. From your question I assume your only problem is finding a good reward function and representation for your agent state.

Defining a state

Absolute positions are rarely useful in a random environment, especially if the environment is infinite like in your example (since the bot can drive over the borders and respawn at the other side). This means that the size of the environment isn't needed for the agent to operate (we will actually need it to simulate the infinite space, tho).

A reward function calculates its return value based on the current state of the agent compared to its previous state. But how do we define a state? Lets see what we actually need in order to operate the agent like we want it to.

  1. The position of the agent.
  2. The position of the nearest mine.

That is all we need. Now I said erlier that absolute positions are bad. This is because it makes the Q table (you call it Q matrix) static and very fragile to randomness. So let's try to completely eliminate abosulte positions from the reward function and replace them with relative positions. Luckily, this is very simple in your case: instead of using the absolute positions, we use the relative position between the nearest mine and the agent.

Now we don't deal with coordinates anymore, but vectors. Lets calculate the vector between our points: v = pos_mine - pos_agent . This vector gives us two very important pieces of information:

  1. the direction in which the nearst mine is, and
  2. the distance to the nearest mine.

And these are all we need to make our agent operational. Therefore, an agent state can be defined as

State: Direction x Distance

of which distance is a floating point value and direction either a float that describes the angle or a normalized vector.

Defining a reward function

Given our newly defined state, the only thing we care about in our reward function is the distance. Since all we want is to move the agent towards mines, the distance is all that matters. Here are a few guesses how the reward function could work:

  1. If the agent sweeps a mine (distance == 0), return a huge reward (ex. 100).
  2. If the agent moves towards a mine (distance is shrinking), return a neutral (or small) reward (ex. 0).
  3. If the agent moves away from a mine (distance is increasing), retuan a negative reward (ex. -1).

Theoretically, since we penaltize moving away from a mine, we don't even need rule 1 here.

Conclusion

The only thing left is determining a good learning rate and discount so that your agent performs well after 50 iterations. But, given the simplicity of the environment, this shouldn't even matter that much. Experiment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM