Deep Learning: when learning rate is too high

Question

I observed something really odd in my code when I vary the learning rate of SGD in Keras:

def build_mlp():
    model = Sequential()
    model.add(Conv2D(24, nb_row=3, nb_col=3, border_mode='same', activation='relu', input_shape=(28, 28, 1)))
    model.add(BatchNormalization(momentum=0.8))
    model.add(Conv2D(24, nb_row=3, nb_col=3, border_mode='same', activation='relu'))
    model.add(BatchNormalization(momentum=0.8))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(10, activation='softmax'))
    model.summary()

    return model


model = build_mlp()
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.0005), metrics=['accuracy'])

During the training with MNIST dataset, I double the learning rate for every 5 epochs. I expect that the loss will diverge and oscillate when learning rate increase. However, I find that after the learning rate increase from 0.4 to 0.8, the loss and accuracy do not change any more. Part of the records are here:

Epoch, Learning rate, Accuracy, Loss
45,0.05119999870657921,0.67200000166893,5.286721663475037
46,0.05119999870657921,0.44419999949634076,8.957198877334594
47,0.05119999870657921,0.21029999982565642,12.728459935188294
48,0.05119999870657921,0.09939999926835298,14.515956773757935
49,0.05119999870657921,0.09949999924749137,14.514344959259033
50,0.10239999741315842,0.09939999926835298,14.515956773757935
51,0.10239999741315842,0.09979999924078584,14.509509530067444
52,0.10239999741315842,0.10109999923035502,14.488556008338929
53,0.10239999741315842,0.10089999923482537,14.49177963256836
54,0.10239999741315842,0.09979999924078584,14.509509530067444
55,0.20479999482631683,0.09899999927729368,14.522404017448425
56,0.20479999482631683,0.10129999965429307,14.4853324508667
57,0.20479999482631683,0.10119999963790179,14.486944255828858
58,0.20479999482631683,0.10129999965429307,14.4853324508667
59,0.20479999482631683,0.10119999963790179,14.486944255828858
60,0.40959998965263367,0.10129999965429307,14.4853324508667
61,0.40959998965263367,0.10119999963790179,14.486944255828858
62,0.40959998965263367,0.10129999965429307,14.4853324508667
63,0.40959998965263367,0.10139999965205788,14.48372064113617
64,0.40959998965263367,0.09189999906346202,14.636842398643493
65,0.8191999793052673,0.10099999930709601,14.490167903900147
66,0.8191999793052673,0.10099999930709601,14.490167903900147
67,0.8191999793052673,0.10099999930709601,14.490167903900147
68,0.8191999793052673,0.10099999930709601,14.490167903900147
69,0.8191999793052673,0.10099999930709601,14.490167903900147
70,1.6383999586105347,0.10099999930709601,14.490167903900147
71,1.6383999586105347,0.10099999930709601,14.490167903900147
72,1.6383999586105347,0.10099999930709601,14.490167903900147
73,1.6383999586105347,0.10099999930709601,14.490167903900147

As we can see, after epoch 65, the loss is stick at 14.490167903900147 and does not change anymore. Any idea of this phenomenon? Any suggestion is appreciated!

Answer 1

What happens is that your high learning rate has driven the layer's weights out of bounds. That in turn causes the softmax function to output values that are either exactly 0 and 1 or very close to those numbers. The network becomes "too confident."

So regardless of input, your network will output 10-dimensional vectors like this:

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
...

On average it will guess correct every tenth time so the accuracy stays at 10%.

To calculate the loss for the network, Keras calculates the loss for each sample and then averages it. In this case, the loss is the categorical crossentropy which is equivalent to taking the negative log of the target labels probability.

If it is 1, the negative log is 0:

-np.log(1.0) = 0.0

But what if it is 0? The log of 0 isn't defined so Keras adds a bit of smoothing to the value:

-np.log(0.0000001) = 16.11809565095832

So for 9 out of 10 samples the loss is 16.11809565095832 and for 1 out of 10 it is 0. Thus on average:

16.11809565095832 * 0.9 = 14.506286085862488

Deep Learning: when learning rate is too high

Question

1 answers

solution1
1 ACCPTED 2020-06-18 02:54:39

Deep Learning: when learning rate is too high

Question

1 answers

solution1 1 ACCPTED 2020-06-18 02:54:39

solution1
1 ACCPTED 2020-06-18 02:54:39