Difference between GradientDescentOptimizer and AdamOptimizer in tensorflow?

Question

When using GradientDescentOptimizer instead of Adam Optimizer the model doesn't seem to converge. On the otherhand, AdamOptimizer seems to work fine. Is the something wrong with the GradientDescentOptimizer from tensorflow?

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np

def randomSample(size=100):
    """
    y = 2 * x -3
    """
    x = np.random.randint(500, size=size)
    y = x * 2  - 3 - np.random.randint(-20, 20, size=size)    

    return x, y

def plotAll(_x, _y, w, b):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(_x, _y)

    x = np.random.randint(500, size=20)
    y = w * x + b
    ax.plot(x, y,'r')
    plt.show()

def lr(_x, _y):

    w = tf.Variable(2, dtype=tf.float32)
    b = tf.Variable(3, dtype=tf.float32)

    x = tf.placeholder(tf.float32)
    y = tf.placeholder(tf.float32)

    linear_model = w * x + b
    loss = tf.reduce_sum(tf.square(linear_model - y))
    optimizer = tf.train.AdamOptimizer(0.0003) #GradientDescentOptimizer
    train = optimizer.minimize(loss)

    init = tf.global_variables_initializer()
    sess = tf.Session()
    sess.run(init)
    for i in range(10000):
        sess.run(train, {x : _x, y: _y})
    cw, cb, closs = sess.run([w, b, loss], {x:_x, y:_y})
    print(closs)
    print(cw,cb)

    return cw, cb

x,y = randomSample()
w,b = lr(x,y)
plotAll(x,y, w, b)

Answer 1

I had a similar problem once and it took me a long time to find out the real problem. With gradient descent my loss function was actually growing instead of getting smaller.

It turned out that my learning rate was too high . If you take too big of a step with gradient descent you can end up jumping over the minimum. And if you are really unlucky, like I was you end up jumping so far ahead that your error increases.

Lowering the learning rate should make the model converge. But it could take a long time.

Adam optimizer has momentum , that is, it doesn't just follow the instantaneous gradient, but it keeps track of the direction it was going before with a sort of velocity . This way, if you start going back and forth because of the gradient than the momentum will force you to go slower in this direction. This helps a lot! Adam has a few more tweeks other than momentum that make it the prefered deep learning optimizer.

If you want to read more about optimizers this blog post is very informative. http://ruder.io/optimizing-gradient-descent/

Difference between GradientDescentOptimizer and AdamOptimizer in tensorflow?

Question

1 answers

solution1
6 ACCPTED 2017-09-16 16:57:59

Difference between GradientDescentOptimizer and AdamOptimizer in tensorflow?

Question

1 answers

solution1 6 ACCPTED 2017-09-16 16:57:59

solution1
6 ACCPTED 2017-09-16 16:57:59