简体   繁体   English

实现近似(基于特征)q 学习的问题

[英]Problems with implementing approximate(feature based) q learning

I am new to reinforcement learning.我是强化学习的新手。 I had recently learned about approximate q learning, or feature-based q learning, in which you describe states by features to save space.我最近了解了近似 q 学习,或基于特征的 q 学习,在这种学习中,您可以通过特征来描述状态以节省空间。 I have tried to implement this in a simple grid game.我试图在一个简单的网格游戏中实现这一点。 Here, the agent is supposed to learn to not go into a firepit(signaled by an f) and to instead eat up as much dots as possible.在这里,代理应该学会不进入火坑(由 f 发出信号),而是尽可能多地吃掉点。 Here is the grid used:这是使用的网格:

...A ...一种
.ff .ff
.ff .ff
...f ...F

Here A signals the agent's starting location.这里 A 表示代理的起始位置。 Now, when implementing, I set up two features.现在,在实施时,我设置了两个功能。 One was 1/((distance to closest dot)^2), and the other was (distance to firepit) + 1. When the agent enters a firepit, the program returns with a reward of -100.一个是 1/((到最近点的距离)^2),另一个是(到火坑的距离)+1。当代理进入火坑时,程序返回-100的奖励。 If it goes to a non firepit position that was already visited(and thus there is no dot to be eaten), the reward is -50.如果它转到已经访问过的非火坑位置(因此没有点可吃),则奖励为 -50。 If it goes to an unvisited dot, the reward is +500.如果它到达一个未访问的点,则奖励为 +500。 In the above grid, no matter what the initial weights are, the program never learns the correct weight values.在上面的网格中,无论初始权重是多少,程序永远不会学习到正确的权重值。 Specifically, in the output, the first training session gains a score(how many dots it ate) of 3, but for all other training sessions, the score is just 1 and the weights converge to an incorrect value of -125 for weight 1(distance to firepit) and 25 for weight 2(distance to unvisited dot).具体来说,在输出中,第一次训练的得分(它吃了多少个点)为 3,但对于所有其他训练,得分仅为 1,权重收敛到错误值 -125 权重 1(到火坑的距离)和权重 2(到未访问点的距离)为 25。 Is there something specifically wrong with my code or is my understanding of approximate q learning incorrect?我的代码有什么特别的问题,还是我对近似 q 学习的理解不正确?

I have tried to play around with the rewards that the environment is giving and also with the initial weights.我尝试使用环境给予的奖励以及初始权重。 None of these have fixed the problem.这些都没有解决问题。 Here is the link to the entire program: https://repl.it/repls/WrongCheeryInterface这是整个程序的链接: https : //repl.it/repls/WrongCheeryInterface

Here is what is going on in the main loop:这是主循环中发生的事情:

while(points != NUMPOINTS){
bool playerDied = false;
if(!start){
  if(!atFirepit()){
    r = 0;
    if(visited[player.x][player.y] == 0){
      points += 1;
      r += 500;
    }else{
      r += -50;
    }
  }else{
    playerDied = true;
    r = -100;
  }
}

//Update visited
visited[player.x][player.y] = 1;

if(!start){
  //This is based off the q learning update formula
  pairPoint qAndA = getMaxQAndAction();
  double maxQValue = qAndA.q;
  double sample = r;
  if(!playerDied && points != NUMPOINTS)
    sample = r + (gamma2 * maxQValue);
  double diff = sample - qVal;
  updateWeights(player, diff);
}

// checking end game condition
if(playerDied || points == NUMPOINTS) break;

pairPoint qAndA = getMaxQAndAction();
qVal = qAndA.q;
int bestAction = qAndA.a;

//update player and q value
player.x += dx[bestAction];
player.y += dy[bestAction];

start = false;
}

I would expect that both weights would still be positive, but one of them is negative(the one giving distance to the firepit).我希望两个权重仍然是正的,但其中一个是负的(那个给火坑的距离)。

I also expected the program to learn overtime that it is bad to enter a firepit and also bad, but not as bad, to go to an unvisited dot.我还希望该程序能够通过加班了解到进入火坑和进入一个未访问的点也很糟糕,但没有那么糟糕。

Probably not the anwser you want to hear, but:可能不是您想听到的答案,而是:

  • Have you try to implement the simpler tabular Q-learning before approximate Q-learning ?您是否尝试在近似 Q-learning之前实现更简单的表格Q-learning In your setting, with a few states and actions, it will work pefectly.在您的设置中,通过一些状态和操作,它将完美地工作。 If you are learning, I strongly recommend you to start with the simpler cases in order to get a better understanding/intuition about how Reinforcement Learning works.如果您正在学习,我强烈建议您从更简单的案例开始,以便更好地理解/直观地了解强化学习的工作原理。

  • Do you know the implications of using approximators instead of learning the exact Q function?您知道使用近似器而不是学习确切的 Q 函数的含义吗? In some cases, due to the complexity of the problem (eg, when the state space is continuous) you should approximate the Q function (or the policy, depending on the algorithm), but this may introduce some convergence problems.在某些情况下,由于问题的复杂性(例如,当状态空间是连续的时),您应该近似 Q 函数(或策略,取决于算法),但这可能会引入一些收敛问题。 Additionally, in you case, you are trying to hand-pick some features, which usually required a depth knowledge of the problem (ie, environment) and the learning algorithm.此外,在您的情况下,您正在尝试手动选择一些特征,这通常需要对问题(即环境)和学习算法有深入的了解。

  • Do you understand the meaning of the hyperparameters alpha and gamma ?你理解超参数alphagamma的含义吗? You can not choose them randomly.您不能随意选择它们。 Sometimes they are critical to obtain the expected results, not always, depending heavely on the problem and the learning algorithm.有时它们对于获得预期结果至关重要,但并非总是如此,这在很大程度上取决于问题和学习算法。 In your case, taking a look to the convergence curve of you weights, it's pretty clear that you are using a value of alpha too high.在您的情况下,查看权重的收敛曲线,很明显您使用的alpha值过高。 As you pointed out, after the first training session your weigths remain constant.正如您所指出的,在第一次训练之后,您的体重保持不变。

Therefore, practical recommendations:因此,实用建议:

  • Be sure to solve your grid game using a tabular Q-learning algorithm before trying more complex things.在尝试更复杂的事情之前,请务必使用表格 Q 学习算法解决您的网格游戏。

  • Experiment with different values of alpha , gamma and rewards.尝试不同的alphagamma和奖励值。

  • Read more in depth about approximated RL.深入了解近似强化学习。 A very good and accesible book (starting from zero knowledge) is the classical Sutton and Barto's book: Reinforcement Learning: An Introduction , which you can obtain for free and was updated in 2018.一本非常好且易懂的书(从零知识开始)是经典的萨顿和巴托的书:强化学习:介绍,你可以免费获得,2018年更新。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM