[英]How are the past gradients accumulated in Momentum/Adam when we preprocess the gradients before tf.train.Optimizer.apply_gradients()?
I want to preprocess the gradients before apply_gradients
, and want the past gradients to be accumulated upon processed gradients when tf.train.MomentumOptimizer
or tf.train.AdamOptimizer
is used. 我想在
apply_gradients
之前预处理渐变,并希望在使用tf.train.MomentumOptimizer
或tf.train.AdamOptimizer
时,在处理过的渐变上累积过去的渐变。 I know we can preprocess the gradients between compute_gradients
and apply_gradients
as shown here : 我知道我们能之间进行预处理的梯度
compute_gradients
和apply_gradients
如图所示在这里 :
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
My question is, in the above case, is the history gradients accumulated upon the capped gradients or non-capped ones? 我的问题是,在上述情况下,是在上限梯度或非 上限梯度上累积的历史梯度?
Thanks! 谢谢!
All of the state that optimizers keep is updated in apply_gradients. 优化程序保留的所有状态都在apply_gradients中更新。 There is a bit of a complicated call chain (best followed in optimizer.py ), but the short of it is that apply_gradients eventually calls one of apply_sparse or apply_dense (ignoring resource variables).
有一个复杂的调用链(最好是在optimizer.py中 ),但缺点是apply_gradients最终调用apply_sparse或apply_dense(忽略资源变量)之一。
Going back to Adam, apply_sparse is relatively easy to read since it's an agglomeration of ops rather than a single C++ op. 回到Adam, apply_sparse相对容易阅读,因为它是操作的集合而不是单个C ++操作。 You can see that it updates all of the moments and the variables.
您可以看到它更新了所有时刻和变量。
So to answer your question, if you cap the gradients before calling apply_gradients, then capped gradients will be accumulated in Adam's moments (and likewise for other optimizers). 所以要回答你的问题,如果你在调用apply_gradients之前限制渐变,那么上限渐变将在Adam的时刻累积(同样对于其他优化器)。
There is a bit of a gotcha if you're dealing with sparse gradients (IndexedSlices), since these are disaggregated as they pass through the graph. 如果你正在处理稀疏梯度(IndexedSlices),那么就会有一些问题,因为它们在通过图形时会被分解。 So if you cap them in the disaggregated form, repeated indices may sum up to more than your cap.
因此,如果您按照分解形式对其进行限制,重复索引可能总计超过您的上限。 This will only be an issue if you're doing gather() or using embeddings, but it's worth keeping in mind.
如果您正在进行gather()或使用嵌入,这只会是一个问题,但值得记住。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.