简体   繁体   English

Tensorflow亚当优化器

[英]Tensorflow Adam Optimizer

Okey I have been reading some of the posts regarding AdamOptimizer in tensorflow. Okey我一直在阅读有关Tensorflow中的AdamOptimizer的一些帖子。 I think there is some confusion around, at least among beginners in NNs like me. 我认为至少在像我这样的NN初学者中周围存在一些混乱。

If I understood correctly, tf.train.AdamOptimizer keeps a so-called "adaptative learning rate". 如果我理解正确,则tf.train.AdamOptimizer会保持所谓的“自适应学习率”。 I thought that this learning rate would grow smaller as time increases. 我认为随着时间的增加,学习速度会越来越小。

However, when I plot the function by which the learning rate is scaled, taken from the docs , 但是,当我从docs中绘制绘制学习率的函数时,

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

this is what I get: 这就是我得到的:

t = np.arange(200)
result = np.sqrt(1-0.999**t)/(1-0.9**t)
plt.plot(result)
plt.show

在此处输入图片说明

So, for t = 1, the value for the user-selected learning rate is multiplied by 0.3 Then it decreases quite fast until 0.15 of its value, and then increases with time, slowly approaching the limit = user-selected learning rate. 因此,对于t = 1,用户选择的学习率的值将乘以0.3,然后迅速减小直到其值的0.15,然后随时间增加,逐渐接近极限=用户选择的学习率。

Isn't it a bit weird? 有点奇怪吗? I guess somewhere I am wrong, but I would've expected the learning rate to start at a higher value and then progressively decreasing towards smaller values. 我猜在某个地方我错了,但是我希望学习率从较高的值开始,然后逐渐向较小的值递减。

You can not really plot the Adam learning rate like this, since Adam is a momentum optimizer. 由于Adam是动量优化器,因此您无法真正绘制出这样的Adam学习率。 The applied gradient for each steps depends on a moving average of the mean and standard deviation of the gradients of previous steps. 每个步骤所应用的梯度取决于先前步骤的梯度的平均值和标准偏差的移动平均值。

In general there is no guarantee for the learning to converge, the raw learning rate alpha itself is not directly changed by Adams. 通常,无法保证学习会收敛,原始学习率alpha本身不会直接由Adams更改。 It is only rescaled using the momentums of the gradient. 仅使用梯度的动量对其进行缩放。 The learning only converges well if mean and standard deviation of the gradient decrease over time when reaching the global minimum, which is often the case for simple neural networks. 只有当梯度的均值和标准偏差在达到全局最小值时随时间减小时,学习才能很好地收敛,这对于简单的神经网络通常是这样。

For highly stochastic problems however one might still need to implement some form of learning rate decay to suppress 'oscillations' around the optimal parameters, or at least make them smaller to make sure there really is convergence. 但是,对于高度随机的问题,可能仍然需要实施某种形式的学习速率衰减,以抑制最佳参数周围的“振荡”,或者至少使其变小以确保确实存在收敛。

If you really want to understand how exactly this works you might want to read the Adam paper , it is much simpler than it seems on first sight. 如果您真的想了解它是如何工作的,则可能需要阅读Adam 论文 ,它比乍看之下要简单得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM