简体   繁体   English

自由能源强化学习实施

[英]Free Energy Reinforcement Learning Implementation

I've been trying to implement the algorithm described here , and then test it on the "large action task" described in the same paper. 我一直在尝试实现此处描述的算法,然后对同一篇论文中描述的“大型行动任务”进行测试。

Overview of the algorithm: 算法概述:

在此处输入图片说明

In brief, the algorithm uses an RBM of the form shown below to solve reinforcement learning problems by changing its weights such that the free energy of a network configuration equates to the reward signal given for that state action pair. 简而言之,该算法使用以下所示形式的RBM通过更改权重来解决强化学习问题,以使网络配置的自由能等于为该状态动作对提供的奖励信号。

To select an action, the algorithm performs gibbs sampling while holding the state variables fixed. 为了选择动作,算法在保持状态变量固定的同时执行gibbs采样。 With enough time, this produces the action with the lowest free energy, and thus the highest reward for the given state. 如果有足够的时间,这将产生具有最低自由能的动作,从而为给定状态带来最高的回报。

Overview of the large action task: 大型动作任务概述:

在此处输入图片说明

Overview of the author's guidelines for implementation: 作者实施指南概述:

A restricted Boltzmann machine with 13 hidden variables was trained on an instantiation of the large action task with an 12-bit state space and a 40-bit action space. 在实例化具有12位状态空间和40位动作空间的大型动作任务时,对具有13个隐藏变量的受限Boltzmann机器进行了训练。 Thirteen key states were randomly selected. 随机选择了13个关键状态。 The network was run for 12 000 actions with a learning rate going from 0.1 to 0.01 and temperature going from 1.0 to 0.1 exponentially over the course of training. 在训练过程中,该网络运行了12 000个动作,学习率从0.1到0.01,温度从1.0到0.1呈指数增长。 Each iteration was initialized with a random state. 每次迭代均以随机状态初始化。 Each action selection consisted of 100 iterations of Gibbs sampling. 每个动作选择均包含100次Gibbs采样迭代。

Important omitted details: 重要的省略细节:

  • Were bias units needed? 是否需要偏置单位?
  • Was weight decay needed? 是否需要减肥? And if so, L1 or L2? 如果是,是L1还是L2?
  • Was a sparsity constraint needed for the weights and/or activations? 权重和/或激活是否需要稀疏约束?
  • Was there modification of the gradient descent? 梯度下降有修改吗? (eg momentum) (例如动量)
  • What meta-parameters were needed for these additional mechanisms? 这些额外的机制需要哪些元参数?

My implementation: 我的实现:

I initially assumed the authors' used no mechanisms other than those described in the guidelines, so I tried training the network without bias units. 最初,我认为作者除了指南中所描述的机制外没有使用其他机制,因此我尝试在没有偏差单位的情况下训练网络。 This led to near chance performance, and was my first clue to the fact that some mechanisms used must have been deemed 'obvious' by the authors and thus omitted. 这导致了近乎偶然的表现,这是我发现以下事实的第一个线索,即所使用的某些机制必须被作者视为“显而易见”,因此被省略。

I played around with the various omitted mechanisms mentioned above, and got my best results by using: 我尝试了上述各种省略的机制,并通过使用以下方法获得了最佳效果:

  • softmax hidden units softmax隐藏的单位
  • momentum of .9 (.5 until 5th iteration) 动量为.9(.5直到第5次迭代)
  • bias units for the hidden and visible layers 隐藏层和可见层的偏差单位
  • a learning rate 1/100th of that listed by the authors. 学习率是作者列出的学习率的1/100。
  • l2 weight decay of .0002 l2重量衰减为.0002

But even with all of these modifications, my performance on the task was generally around an average reward of 28 after 12000 iterations. 但是即使进行了所有这些修改,我在任务上的表现通常还是经过12000次迭代后平均获得28分的回报。

Code for each iteration: 每次迭代的代码:

    %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
    poshidprobs = softmax(data*vishid + hidbiases);

    %%%%%%%%% END OF POSITIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    hidstates = softmax_sample(poshidprobs);

    %%%%%%%%% START ACTION SELECTION PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    if test
        [negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
    else
        [negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
    end


    data(numdims+1:end) = negaction > rand(numcases,numactiondims);


    if mod(batch,100) == 1
        disp(poshidprobs);
        disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
    end

    posprods    = data' * poshidprobs;
    poshidact   = poshidprobs;
    posvisact = data;

    %%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


    if batch>5,
        momentum=.9;
    else
        momentum=.5;
    end;

    %%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);

    Q = -F;
    action = data(numdims+1:end);
    reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
    if correct_action(:,(batch)) == correct_action(:,1)
        reward_dataA = [reward_dataA reward];
        Q_A = [Q_A Q];
    else
        reward_dataB = [reward_dataB reward];
        Q_B = [Q_B Q];
    end
    reward_error = sum(reward - Q);
    rewardsum = rewardsum + reward;
    errsum = errsum + abs(reward_error);
    error_data(ind) = reward_error;
    reward_data(ind) = reward;
    Q_data(ind) = Q;

    vishidinc = momentum*vishidinc + ...
        epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
    visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
    hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);

    vishid = vishid + vishidinc;
    hidbiases = hidbiases + hidbiasinc;
    visbiases = visbiases + visbiasinc;

    %%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

What I'm asking for: 我要的是:

So, if any of you can get this algorithm to work properly (the authors claim to average ~40 reward after 12000 iterations), I'd be extremely grateful. 因此,如果任何人都可以使该算法正常工作(作者声称在12000次迭代后平均可获得约40奖励),我将非常感激。

If my code appears to be doing something obviously wrong, then calling attention to that would also constitute a great answer. 如果我的代码似乎做错了明显的事情,那么引起人们的注意也将是一个很好的答案。

I'm hoping that what the authors left out is indeed obvious to someone with more experience with energy-based learning than myself, in which case, simply point out what needs to be included in a working implementation. 我希望作者遗忘的东西对于那些比我有更多基于能量学习经验的人来说确实是显而易见的,在这种情况下,只需指出需要在工作中实现的内容即可。

  1. The algorithm in the paper looks wierd. 本文中的算法看起来很奇怪。 They use a kind of hebbian learning that increases conectonstrength, but no mechanism to decay them. 他们使用一种hebbian学习来增加连接强度,但是没有机制可以削弱它们。 In contrast the regular CD pushes the energy of incorrect fantasies up, balancing overall activiity. 相比之下,常规CD会增加错误幻想的能量,从而平衡总体活动性。 I would speculate that yuo will need strong sparcity regulaton and/or weight decay. 我推测您将需要强大的稀疏调节和/或重量衰减。
  2. bias never would hurt :) 偏见永远不会伤害:)
  3. Momentum and other fancy stuff may speed up, but usually not neccesary. 动量和其他奇特的东西可能会加快速度,但通常不是必需的。
  4. Why softmax on hiddens? 为什么在藏身处使用softmax? Should it be just sigmoid? 应该只是乙状结肠吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM