简体   繁体   English

当我们在tf.train.Optimizer.apply_gradients()之前预处理渐变时,Momentum / Adam中的过去渐变是如何累积的?

[英]How are the past gradients accumulated in Momentum/Adam when we preprocess the gradients before tf.train.Optimizer.apply_gradients()?

I want to preprocess the gradients before apply_gradients , and want the past gradients to be accumulated upon processed gradients when tf.train.MomentumOptimizer or tf.train.AdamOptimizer is used. 我想在apply_gradients之前预处理渐变,并希望在使用tf.train.MomentumOptimizertf.train.AdamOptimizer时,在处理过的渐变上累积过去的渐变。 I know we can preprocess the gradients between compute_gradients and apply_gradients as shown here : 我知道我们能之间进行预处理的梯度compute_gradientsapply_gradients如图所示在这里

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

My question is, in the above case, is the history gradients accumulated upon the capped gradients or non-capped ones? 我的问题是,在上述情况下,是在上限梯度或非 上限梯度上累积的历史梯度?

Thanks! 谢谢!

All of the state that optimizers keep is updated in apply_gradients. 优化程序保留的所有状态都在apply_gradients中更新。 There is a bit of a complicated call chain (best followed in optimizer.py ), but the short of it is that apply_gradients eventually calls one of apply_sparse or apply_dense (ignoring resource variables). 有一个复杂的调用链(最好是在optimizer.py中 ),但缺点是apply_gradients最终调用apply_sparse或apply_dense(忽略资源变量)之一。

Going back to Adam, apply_sparse is relatively easy to read since it's an agglomeration of ops rather than a single C++ op. 回到Adam, apply_sparse相对容易阅读,因为它是操作的集合而不是单个C ++操作。 You can see that it updates all of the moments and the variables. 您可以看到它更新了所有时刻和变量。

So to answer your question, if you cap the gradients before calling apply_gradients, then capped gradients will be accumulated in Adam's moments (and likewise for other optimizers). 所以要回答你的问题,如果你在调用apply_gradients之前限制渐变,那么上限渐变将在Adam的时刻累积(同样对于其他优化器)。

There is a bit of a gotcha if you're dealing with sparse gradients (IndexedSlices), since these are disaggregated as they pass through the graph. 如果你正在处理稀疏梯度(IndexedSlices),那么就会有一些问题,因为它们在通过图形时会被分解。 So if you cap them in the disaggregated form, repeated indices may sum up to more than your cap. 因此,如果您按照分解形式对其进行限制,重复索引可能总计超过您的上限。 This will only be an issue if you're doing gather() or using embeddings, but it's worth keeping in mind. 如果您正在进行gather()或使用嵌入,这只会是一个问题,但值得记住。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Tensorflow中使用bucketing时,如何在Adam优化器中共享渐变和变量? - How to share gradients and variables in Adam optimizer when using bucketing in Tensorflow? Adam 优化器在几乎相同的 model 上花费更长的时间来应用梯度 - Adam optimizer takes much longer on the almost the same model to apply gradients TensorFlow-使用动量优化器时为多个目标目标组合梯度吗? - TensorFlow - Combining gradients for multiple target objectives when using a momentum optimizer? 如何使用tf.gradients和gradients_apply - how to use tf.gradients and gradients_apply Adam 优化器:ValueError:没有为任何变量提供梯度 - Adam optimizer: ValueError: No gradients provided for any variable optimizer.apply_gradients 不更新 TF 2.0 中的权重 - optimizer.apply_gradients does not update the weights in TF 2.0 Optimizer.apply_gradients 在 tf.function 中创建变量 - Optimizer.apply_gradients creating variables in tf.function keras.optimizers.Adam.apply_gradients(...) 失败 - keras.optimizers.Adam.apply_gradients(...) fails tf.gradients和tf.train.Optimizer.compute_gradient有什么区别? - What is the difference between tf.gradients and tf.train.Optimizer.compute_gradient? TensorFlow 如何计算 tf.train.GradientDescentOptimizer 的梯度? - How does TensorFlow calculate the gradients for the tf.train.GradientDescentOptimizer?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM