简体   繁体   中英

Problems with implementing approximate(feature based) q learning

I am new to reinforcement learning. I had recently learned about approximate q learning, or feature-based q learning, in which you describe states by features to save space. I have tried to implement this in a simple grid game. Here, the agent is supposed to learn to not go into a firepit(signaled by an f) and to instead eat up as much dots as possible. Here is the grid used:

...A
.ff
.ff
...f

Here A signals the agent's starting location. Now, when implementing, I set up two features. One was 1/((distance to closest dot)^2), and the other was (distance to firepit) + 1. When the agent enters a firepit, the program returns with a reward of -100. If it goes to a non firepit position that was already visited(and thus there is no dot to be eaten), the reward is -50. If it goes to an unvisited dot, the reward is +500. In the above grid, no matter what the initial weights are, the program never learns the correct weight values. Specifically, in the output, the first training session gains a score(how many dots it ate) of 3, but for all other training sessions, the score is just 1 and the weights converge to an incorrect value of -125 for weight 1(distance to firepit) and 25 for weight 2(distance to unvisited dot). Is there something specifically wrong with my code or is my understanding of approximate q learning incorrect?

I have tried to play around with the rewards that the environment is giving and also with the initial weights. None of these have fixed the problem. Here is the link to the entire program: https://repl.it/repls/WrongCheeryInterface

Here is what is going on in the main loop:

while(points != NUMPOINTS){
bool playerDied = false;
if(!start){
  if(!atFirepit()){
    r = 0;
    if(visited[player.x][player.y] == 0){
      points += 1;
      r += 500;
    }else{
      r += -50;
    }
  }else{
    playerDied = true;
    r = -100;
  }
}

//Update visited
visited[player.x][player.y] = 1;

if(!start){
  //This is based off the q learning update formula
  pairPoint qAndA = getMaxQAndAction();
  double maxQValue = qAndA.q;
  double sample = r;
  if(!playerDied && points != NUMPOINTS)
    sample = r + (gamma2 * maxQValue);
  double diff = sample - qVal;
  updateWeights(player, diff);
}

// checking end game condition
if(playerDied || points == NUMPOINTS) break;

pairPoint qAndA = getMaxQAndAction();
qVal = qAndA.q;
int bestAction = qAndA.a;

//update player and q value
player.x += dx[bestAction];
player.y += dy[bestAction];

start = false;
}

I would expect that both weights would still be positive, but one of them is negative(the one giving distance to the firepit).

I also expected the program to learn overtime that it is bad to enter a firepit and also bad, but not as bad, to go to an unvisited dot.

Probably not the anwser you want to hear, but:

  • Have you try to implement the simpler tabular Q-learning before approximate Q-learning ? In your setting, with a few states and actions, it will work pefectly. If you are learning, I strongly recommend you to start with the simpler cases in order to get a better understanding/intuition about how Reinforcement Learning works.

  • Do you know the implications of using approximators instead of learning the exact Q function? In some cases, due to the complexity of the problem (eg, when the state space is continuous) you should approximate the Q function (or the policy, depending on the algorithm), but this may introduce some convergence problems. Additionally, in you case, you are trying to hand-pick some features, which usually required a depth knowledge of the problem (ie, environment) and the learning algorithm.

  • Do you understand the meaning of the hyperparameters alpha and gamma ? You can not choose them randomly. Sometimes they are critical to obtain the expected results, not always, depending heavely on the problem and the learning algorithm. In your case, taking a look to the convergence curve of you weights, it's pretty clear that you are using a value of alpha too high. As you pointed out, after the first training session your weigths remain constant.

Therefore, practical recommendations:

  • Be sure to solve your grid game using a tabular Q-learning algorithm before trying more complex things.

  • Experiment with different values of alpha , gamma and rewards.

  • Read more in depth about approximated RL. A very good and accesible book (starting from zero knowledge) is the classical Sutton and Barto's book: Reinforcement Learning: An Introduction , which you can obtain for free and was updated in 2018.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM