[英]Free Energy Reinforcement Learning Implementation
I've been trying to implement the algorithm described here , and then test it on the "large action task" described in the same paper. 我一直在尝试实现此处描述的算法,然后对同一篇论文中描述的“大型行动任务”进行测试。
Overview of the algorithm: 算法概述:
In brief, the algorithm uses an RBM of the form shown below to solve reinforcement learning problems by changing its weights such that the free energy of a network configuration equates to the reward signal given for that state action pair. 简而言之,该算法使用以下所示形式的RBM通过更改权重来解决强化学习问题,以使网络配置的自由能等于为该状态动作对提供的奖励信号。
To select an action, the algorithm performs gibbs sampling while holding the state variables fixed. 为了选择动作,算法在保持状态变量固定的同时执行gibbs采样。 With enough time, this produces the action with the lowest free energy, and thus the highest reward for the given state. 如果有足够的时间,这将产生具有最低自由能的动作,从而为给定状态带来最高的回报。
Overview of the large action task: 大型动作任务概述:
Overview of the author's guidelines for implementation: 作者实施指南概述:
A restricted Boltzmann machine with 13 hidden variables was trained on an instantiation of the large action task with an 12-bit state space and a 40-bit action space. 在实例化具有12位状态空间和40位动作空间的大型动作任务时,对具有13个隐藏变量的受限Boltzmann机器进行了训练。 Thirteen key states were randomly selected. 随机选择了13个关键状态。 The network was run for 12 000 actions with a learning rate going from 0.1 to 0.01 and temperature going from 1.0 to 0.1 exponentially over the course of training. 在训练过程中,该网络运行了12 000个动作,学习率从0.1到0.01,温度从1.0到0.1呈指数增长。 Each iteration was initialized with a random state. 每次迭代均以随机状态初始化。 Each action selection consisted of 100 iterations of Gibbs sampling. 每个动作选择均包含100次Gibbs采样迭代。
Important omitted details: 重要的省略细节:
My implementation: 我的实现:
I initially assumed the authors' used no mechanisms other than those described in the guidelines, so I tried training the network without bias units. 最初,我认为作者除了指南中所描述的机制外没有使用其他机制,因此我尝试在没有偏差单位的情况下训练网络。 This led to near chance performance, and was my first clue to the fact that some mechanisms used must have been deemed 'obvious' by the authors and thus omitted. 这导致了近乎偶然的表现,这是我发现以下事实的第一个线索,即所使用的某些机制必须被作者视为“显而易见”,因此被省略。
I played around with the various omitted mechanisms mentioned above, and got my best results by using: 我尝试了上述各种省略的机制,并通过使用以下方法获得了最佳效果:
But even with all of these modifications, my performance on the task was generally around an average reward of 28 after 12000 iterations. 但是即使进行了所有这些修改,我在任务上的表现通常还是经过12000次迭代后平均获得28分的回报。
Code for each iteration: 每次迭代的代码:
%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
poshidprobs = softmax(data*vishid + hidbiases);
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidstates = softmax_sample(poshidprobs);
%%%%%%%%% START ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if test
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
else
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
end
data(numdims+1:end) = negaction > rand(numcases,numactiondims);
if mod(batch,100) == 1
disp(poshidprobs);
disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
end
posprods = data' * poshidprobs;
poshidact = poshidprobs;
posvisact = data;
%%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if batch>5,
momentum=.9;
else
momentum=.5;
end;
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);
Q = -F;
action = data(numdims+1:end);
reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
if correct_action(:,(batch)) == correct_action(:,1)
reward_dataA = [reward_dataA reward];
Q_A = [Q_A Q];
else
reward_dataB = [reward_dataB reward];
Q_B = [Q_B Q];
end
reward_error = sum(reward - Q);
rewardsum = rewardsum + reward;
errsum = errsum + abs(reward_error);
error_data(ind) = reward_error;
reward_data(ind) = reward;
Q_data(ind) = Q;
vishidinc = momentum*vishidinc + ...
epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);
vishid = vishid + vishidinc;
hidbiases = hidbiases + hidbiasinc;
visbiases = visbiases + visbiasinc;
%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
What I'm asking for: 我要的是:
So, if any of you can get this algorithm to work properly (the authors claim to average ~40 reward after 12000 iterations), I'd be extremely grateful. 因此,如果任何人都可以使该算法正常工作(作者声称在12000次迭代后平均可获得约40奖励),我将非常感激。
If my code appears to be doing something obviously wrong, then calling attention to that would also constitute a great answer. 如果我的代码似乎做错了明显的事情,那么引起人们的注意也将是一个很好的答案。
I'm hoping that what the authors left out is indeed obvious to someone with more experience with energy-based learning than myself, in which case, simply point out what needs to be included in a working implementation. 我希望作者遗忘的东西对于那些比我有更多基于能量学习经验的人来说确实是显而易见的,在这种情况下,只需指出需要在工作中实现的内容即可。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.