With neural networks, should the learning rate be in some way proportional to hidden layer sizes? Should they affect each other?

Question

My neural network is normal feed-forward and back prop. Has 10 outputs, which should be a vector where one of the output is 1, and the rest 0. So something like [0,0,0,0,1,0,0,0,0]. So an output I would expect is something like this:

[ 0.21332215,0.13782996,0.13548511,0.09321094,0.16769843,0.20333131, 0.06613014,0.10699013,0.10622562,0.09809167]

and ideally once trained, this:

[ 0.21332215,0.13782996,0.93548511 ,0.09321094 ,**0.9**676984,0.20333131, 0.06613014,0.1069901,0.10622562, 0.09809167]

When I have 30 neurons on the hidden layer, and a learning rate of > 0.1 but < 1, i get these results. However, when i have 100 neurons on hidden, and have a learning rate of 0.01, i get results like this:

[  1.75289110e-05,1.16433042e-04 ,2.83848791e-01,4.47291309e-02, 1.63011592e-01,8.12974408e-05 , 1.06284533e-03 , 2.95174797e-02, 7.54112632e-05, 1.33177529e-03]

Why is this? Is this what over-learning looks like?

Then, when I change the learning rate to 0.0001 with 100 neurons on hidden, it get normal results again.

So my question is: how should the learning rate affect the hidden layer count? Should bigger hidden layers mean lower learning rates?

Answer 1

It can be said that there is a slight relation between the hidden unit count and the learning rate, in general, when you increase the hidden unit count, you obtain a more heavily parametrised model with a higher capacity and such a model is always more prone to overfitting on the same training set. In addition to that, this model operates in a space with a larger dimension and has a more complex error surface compared to a thinner model. When you apply a larger learning rate in such a complex error regime, the SGD process may easily diverge to meaningless locations, which I believe, is the real reason you are getting that weird results with the higher learning rate. In short, it is logical that smaller learning rates work more reasonably when the model is too complex.

With neural networks, should the learning rate be in some way proportional to hidden layer sizes? Should they affect each other?

Question

1 answers

solution1
10 ACCPTED 2016-12-04 18:08:48

With neural networks, should the learning rate be in some way proportional to hidden layer sizes? Should they affect each other?

Question

1 answers

solution1 10 ACCPTED 2016-12-04 18:08:48

solution1
10 ACCPTED 2016-12-04 18:08:48