[英]Free Energy Reinforcement Learning Implementation
我一直在尝试实现此处描述的算法,然后对同一篇论文中描述的“大型行动任务”进行测试。
算法概述:
简而言之,该算法使用以下所示形式的RBM通过更改权重来解决强化学习问题,以使网络配置的自由能等于为该状态动作对提供的奖励信号。
为了选择动作,算法在保持状态变量固定的同时执行gibbs采样。 如果有足够的时间,这将产生具有最低自由能的动作,从而为给定状态带来最高的回报。
大型动作任务概述:
作者实施指南概述:
在实例化具有12位状态空间和40位动作空间的大型动作任务时,对具有13个隐藏变量的受限Boltzmann机器进行了训练。 随机选择了13个关键状态。 在训练过程中,该网络运行了12 000个动作,学习率从0.1到0.01,温度从1.0到0.1呈指数增长。 每次迭代均以随机状态初始化。 每个动作选择均包含100次Gibbs采样迭代。
重要的省略细节:
我的实现:
最初,我认为作者除了指南中所描述的机制外没有使用其他机制,因此我尝试在没有偏差单位的情况下训练网络。 这导致了近乎偶然的表现,这是我发现以下事实的第一个线索,即所使用的某些机制必须被作者视为“显而易见”,因此被省略。
我尝试了上述各种省略的机制,并通过使用以下方法获得了最佳效果:
但是即使进行了所有这些修改,我在任务上的表现通常还是经过12000次迭代后平均获得28分的回报。
每次迭代的代码:
%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
poshidprobs = softmax(data*vishid + hidbiases);
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidstates = softmax_sample(poshidprobs);
%%%%%%%%% START ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if test
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
else
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
end
data(numdims+1:end) = negaction > rand(numcases,numactiondims);
if mod(batch,100) == 1
disp(poshidprobs);
disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
end
posprods = data' * poshidprobs;
poshidact = poshidprobs;
posvisact = data;
%%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if batch>5,
momentum=.9;
else
momentum=.5;
end;
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);
Q = -F;
action = data(numdims+1:end);
reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
if correct_action(:,(batch)) == correct_action(:,1)
reward_dataA = [reward_dataA reward];
Q_A = [Q_A Q];
else
reward_dataB = [reward_dataB reward];
Q_B = [Q_B Q];
end
reward_error = sum(reward - Q);
rewardsum = rewardsum + reward;
errsum = errsum + abs(reward_error);
error_data(ind) = reward_error;
reward_data(ind) = reward;
Q_data(ind) = Q;
vishidinc = momentum*vishidinc + ...
epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);
vishid = vishid + vishidinc;
hidbiases = hidbiases + hidbiasinc;
visbiases = visbiases + visbiasinc;
%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
我要的是:
因此,如果任何人都可以使该算法正常工作(作者声称在12000次迭代后平均可获得约40奖励),我将非常感激。
如果我的代码似乎做错了明显的事情,那么引起人们的注意也将是一个很好的答案。
我希望作者遗忘的东西对于那些比我有更多基于能量学习经验的人来说确实是显而易见的,在这种情况下,只需指出需要在工作中实现的内容即可。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.