简体   繁体   中英

Difference between GradientDescentOptimizer and AdamOptimizer in tensorflow?

When using GradientDescentOptimizer instead of Adam Optimizer the model doesn't seem to converge. On the otherhand, AdamOptimizer seems to work fine. Is the something wrong with the GradientDescentOptimizer from tensorflow?

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np

def randomSample(size=100):
    """
    y = 2 * x -3
    """
    x = np.random.randint(500, size=size)
    y = x * 2  - 3 - np.random.randint(-20, 20, size=size)    

    return x, y

def plotAll(_x, _y, w, b):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(_x, _y)

    x = np.random.randint(500, size=20)
    y = w * x + b
    ax.plot(x, y,'r')
    plt.show()

def lr(_x, _y):

    w = tf.Variable(2, dtype=tf.float32)
    b = tf.Variable(3, dtype=tf.float32)

    x = tf.placeholder(tf.float32)
    y = tf.placeholder(tf.float32)

    linear_model = w * x + b
    loss = tf.reduce_sum(tf.square(linear_model - y))
    optimizer = tf.train.AdamOptimizer(0.0003) #GradientDescentOptimizer
    train = optimizer.minimize(loss)

    init = tf.global_variables_initializer()
    sess = tf.Session()
    sess.run(init)
    for i in range(10000):
        sess.run(train, {x : _x, y: _y})
    cw, cb, closs = sess.run([w, b, loss], {x:_x, y:_y})
    print(closs)
    print(cw,cb)

    return cw, cb

x,y = randomSample()
w,b = lr(x,y)
plotAll(x,y, w, b)

I had a similar problem once and it took me a long time to find out the real problem. With gradient descent my loss function was actually growing instead of getting smaller.

It turned out that my learning rate was too high . If you take too big of a step with gradient descent you can end up jumping over the minimum. And if you are really unlucky, like I was you end up jumping so far ahead that your error increases.

Lowering the learning rate should make the model converge. But it could take a long time.

Adam optimizer has momentum , that is, it doesn't just follow the instantaneous gradient, but it keeps track of the direction it was going before with a sort of velocity . This way, if you start going back and forth because of the gradient than the momentum will force you to go slower in this direction. This helps a lot! Adam has a few more tweeks other than momentum that make it the prefered deep learning optimizer.

If you want to read more about optimizers this blog post is very informative. http://ruder.io/optimizing-gradient-descent/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM