Tensorflow：关于adam优化器的困惑

Question

I'm confused regarding as to how the adam optimizer actually works in tensorflow. 关于adam优化器如何在tensorflow中实际工作，我感到很困惑。

The way I read the docs , it says that the learning rate is changed every gradient descent iteration. 我阅读文档的方式是，每次梯度下降迭代都会改变学习速度。

But when I call the function I give it a learning rate. 但是当我调用该函数时，我给它一个学习率。 And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). 我并没有把这个函数称为，做一个时代（隐式地调用＃iterations以便进行我的数据训练）。 I call the function for each batch explicitly like 我明确地调用每个批次的函数

for epoch in epochs
     for batch in data
          sess.run(train_adam_step, feed_dict={eta:1e-3})

So my eta cannot be changing. 所以我的eta无法改变。 And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t is incremented each time I call the optimizer? 而且我没有传入时间变量。或者这是某种生成器类型的东西，每当我调用优化器时，会话创建t会递增吗？

Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? 假设它是一些生成器类型的东西，并且学习率正在被无形地降低：如何在不降低学习速率的情况下运行adam优化器？ It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum and decay to match beta1 and beta2 respectively. 在我看来， RMSProp基本上是相同的，我唯一需要做的就是让它相等（学习率忽略不计）是改变超参数momentum和decay分别匹配beta1和beta2 。 Is that correct? 那是对的吗？

Answer 1

I find the documentation quite clear, I will paste here the algorithm in pseudo-code: 我发现文档很清楚，我将这里的算法粘贴到伪代码中：

Your parameters : 你的参数 ：

learning_rate : between 1e-4 and 1e-2 is standard learning_rate ：1e-4和1e-2之间是标准的
beta1 : 0.9 by default beta1 ：0.9默认情况下
beta2 : 0.999 by default beta2 ：默认为0.999
epsilon : 1e-08 by default epsilon ：默认为1e-08

The default value of 1e-8 for epsilon might not be a good default in general. 一般来说，epsilon的默认值1e-8可能不是一个好的默认值。 For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. 例如，在ImageNet上训练Inception网络时，当前的好选择是1.0或0.1。

Initialization: 初始化：

m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)

m_t and v_t will keep track of a moving average of the gradient and its square, for each parameters of the network. 对于网络的每个参数， m_t和v_t将跟踪梯度的移动平均值及其平方。 (So if you have 1M parameters, Adam will keep in memory 2M more parameters) （所以如果你有1M个参数，Adam会在内存中保留2M个参数）

At each iteration t , and for each parameter of the model : 在每次迭代t ，以及模型的每个参数 ：

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

Here lr_t is a bit different from learning_rate because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t) . 这里lr_t与learning_rate lr_t不同，因为对于早期迭代，移动平均线尚未收敛，因此我们必须通过乘以sqrt(1 - beta2^t) / (1 - beta1^t)进行归一化。 When t is high ( t > 1./(1.-beta2) ), lr_t is almost equal to learning_rate 当t为高（ t > 1./(1.-beta2) ）时， lr_t几乎等于learning_rate

To answer your question, you just need to pass a fixed learning rate , keep beta1 and beta2 default values, maybe modify epsilon , and Adam will do the magic :) 要回答你的问题，你只需要传递一个固定的学习率 ，保持beta1和beta2默认值，也许修改epsilon ，而Adam会做出魔术:)

Link with RMSProp 与RMSProp联系

Adam with beta1=1 is equivalent to RMSProp with momentum=0 . 具有beta1=1 Adam等同于具有momentum=0 RMSProp。 The argument beta2 of Adam and the argument decay of RMSProp are the same. 亚当的论证beta2和RMSProp的论证decay是相同的。

However, RMSProp does not keep a moving average of the gradient. 但是，RMSProp不保持梯度的移动平均值。 But it can maintain a momentum, like MomentumOptimizer. 但它可以保持势头，如MomentumOptimizer。

A detailed description of rmsprop. rmsprop的详细说明。

maintain a moving (discounted) average of the square of gradients 保持梯度平方的移动（折扣）平均值
divide gradient by the root of this average 用这个平均值的根除以梯度
(can maintain a momentum) （可以保持势头）

Here is the pseudo-code: 这是伪代码：

v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom

Answer 2

RMS_PROP and ADAM both have adaptive learning rates . RMS_PROP和ADAM都具有自适应学习速率。

The basic RMS_PROP 基本的RMS_PROP

cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

You can see originally this has two parameters decay_rate & eps 你可以看到最初这有两个参数decay_rate和eps

Then we can add a momentum to make our gradient more stable Then we can write 然后我们可以添加一个动量来使我们的渐变更稳定然后我们可以写

cache = decay_rate * cache + (1 - decay_rate) * dx**2
**m = beta1*m + (1-beta1)*dx**  [beta1 =momentum parameter in the doc ]
x += - learning_rate * dx / (np.sqrt(cache) + eps)

Now you can see here if we keep beta1 = o Then it's rms_prop without the momentum . 现在你可以在这里看到我们是否保持beta1 = o然后它是没有动量的rms_prop。

Then Basics of ADAM 然后是ADAM的基础知识

In cs-231 Andrej Karpathy has initially described the adam like this 在cs-231中，Andrej Karpathy最初描述了这样的亚当

Adam is a recently proposed update that looks a bit like RMSProp with momentum Adam是最近提出的更新，看起来有点像RMSProp的动力

So yes ! 所以是的！ Then what makes this difference from the rms_prop with momentum ? 然后是什么使rms_prop与动量有所不同？

m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
**x += - learning_rate * m / (np.sqrt(v) + eps)**

He again mentioned in the updating equation m , v are more smooth . 他在更新方程m中再次提到，v更平滑 。

So the difference from the rms_prop is the update is less noisy . 因此与rms_prop的区别在于更新噪声较小。

What makes this noise ? 是什么让这个噪音？

Well in the initialization procedure we will initialize m and v as zero . 在初始化过程中，我们将m和v初始化为零。

m=v=0 M = V = 0

In order to reduce this initializing effect it's always to have some warm-up . 为了减少这种初始化效果，总是需要进行一些预热。 So then equation is like 所以等式就好了

m = beta1*m + (1-beta1)*dx          beta1 -o.9 beta2-0.999
**mt = m / (1-beta1**t)**
v = beta2*v + (1-beta2)*(dx**2)
**vt = v / (1-beta2**t)**
x += - learning_rate * mt / (np.sqrt(vt) + eps)

Now we run this for few iterations . 现在我们运行几次迭代。 Clearly pay attention to the bold lines , you can see when t is increasing (iteration number) following thing happen to the mt , 清楚地注意粗线，你可以看到当t增加（迭代次数）后面的事情发生在mt，

mt = m mt = m

Tensorflow：关于adam优化器的困惑

问题描述

2 个解决方案

解决方案1
23 已采纳 2016-06-15 18:30:02

Link with RMSProp 与RMSProp联系

A detailed description of rmsprop. rmsprop的详细说明。

解决方案2
2 2017-08-21 00:51:43

Tensorflow：关于adam优化器的困惑

问题描述

2 个解决方案

解决方案1 23 已采纳 2016-06-15 18:30:02

Link with RMSProp 与RMSProp联系

A detailed description of rmsprop. rmsprop的详细说明。

解决方案2 2 2017-08-21 00:51:43

解决方案1
23 已采纳 2016-06-15 18:30:02

解决方案2
2 2017-08-21 00:51:43