简体   繁体   中英

Choosing starting parameters for the Levenberg-Marquardt Algorithm on a Neural Net

I'm currently working on a project in which an ANN is being used. For the training algorithm, I selected LMA as it is fairly fast and versatile, and I read a paper which suggests it is the best training algorithm for our use case. After writing it however, I became concerned as the SSE (sum of the squared errors divided by 2), was only being reduced from 2.05 to 1.00 on a simple XOR problem using a network with 2 inputs, 1 hidden layer with 2 nodes, and 1 output. I thought that somewhere I had made a mistake in programming it, however when I tried changing the PRNG seed value, suddenly the SSE converged to 2.63e-09. This was even more disconcerting compared to the possible programming error, however, as I wouldn't expect the performance of the algorithm to be affected this much by random chance for such a simple problem.

The PRNG generates the biases and weights according to a bimodal distribution with modes 0.8 and -0.8 and the probability distribution drops close to 0 around 0, so hopefully I shouldn't be harming the algorithm from the start with very small parameters, but are there any other tips for generating good starting values? I'm using tanh for my sigmoid function, if that makes a difference. I'm thinking that perhaps using values with a larger magnitude might make a difference, but I'm equally concerned that could have detrimental effects as well.

I know that LMA only converges to a local minimum, but surely with how ubiquitously it is used there is some way to avoid these problems. Was I simply unlucky with my seed value? Should I simply repeat the training with a new seed value every time it gets stuck? Should I look towards another training algorithm entirely?

The ANN is going to be first pretrained on some historical data, and then updated on a regular basis with more recent data, so although I probably can afford to repeat the training a few times if necessary, there's a practical limit to how many seed values can be tried. Also although this initial test only had 9 parameters, we will eventually be dealing with close to 10,000, and perhaps more than one hidden layer. My instinct is that this will worsen the problem with local minima, but is it possible that an increased problem size could actually be beneficial?

TL;DR

The problem was that my network configuration was too small to handle the complexity of XOR. Using [2,3,1] saw immediate improvements, and [2,4,1] was even better. Other logic tables didn't need as large a network.


Progress!

Ok, I think I've made some progress and found the source of the problem. I trained a set of 100 random XOR networks using layer sizes of [2,2,1], and then plotted a reverse cumulative graph of the number of networks which reached a given SSE after up to 1000 epochs (stopping early after the SSE drops below 1e-8).

异或

This was the graph I obtained for (a XOR b). 8% of the networks were corrupted by NAN values (I assume it has something to do with the decomposition and matrix library I'm using, but I digress). What's worrying though is that of the 92 valid sample networks, only 43% reached an SSE lower than ~1. Higher samples tended to produce even worse results. IIRC, for a sample size of 1000 (with a lower number of epochs) I found that only 4% went below 1. A more recent test using 1000 epochs again for a sample size of 1000 produced a more respectable 47%. Nevertheless, this was unacceptable for me, and very frustrating as those that did make it below 1 tended to do very well, typically reaching at least 1e-6 or even better.

Anyway, we recently wrote some python bindings and implemented several more test networks, expecting to see similar results. Surprisingly, however, these tests worked almost perfectly, with over 90% tending to do better than 1e-6:

a AND b

[2,2,1] layer configuration. a和b

a OR b

[2,2,1] layer configuration. a或b

a OR (b AND c)

[3,4,1] layer configuration (3 inputs: a,b,c). a OR(b AND c)


Clearly, something was wrong with the XOR network in particular, and I simply had the misfortune of picking XOR for my first test problem. Reading other questions on SO, it seems that XOR networks aren't modelled well on small networks, and are impossible on [2,2,1] networks without biases. I had biases, but clearly that wasn't enough. Finally, with these clues, I have been able to bring the XOR network closer into line with the other problems. By simply adding 1 more hidden node and using a [2,3,1] layer configuration, I was able to raise the proportion of samples hitting 1e-6 to over 70%:

异或

Using [2,4,1] raised it to 85%:

在此处输入图片说明

Clearly my problem was that my network size just wasn't large enough to handle the complexity of a XOR network, and I suggest that anyone testing their neural networks with a 2-bit XOR problem should keep this in mind!

Thanks for bearing with me through this long post, and apologies for the excessive use of images. I hope this saves people in a similar situation a lot of headaches!

Extra information related to the question

During my investigation, I learnt quite a bit, and so I'd like to address some more points about using LMA that may be of interest.

First, the distribution seems to make no difference, as long as it is a random distribution. I tried the bimodal distribution mentioned in the question, a uniform distribution between 0 and 1, a gaussian, a gaussian with mean 0.5 and SD 0.5, even a triangular distribution, and they all gave very similar results. I'm sticking with bimodal however, as it seems the most natural IMHO.

Secondly, it is clear that even for the simple problems I had here, that repeated training is necessary. Whilst ~90% of the samples produced a decent SSE, the other 10% show that you always need to anticipate the need to repeat your training with a new set of random weights multiple times, at least until you get your desired SSE, but possibly a fixed number of times to select the best out of your sample.

Finally, my tests have led me to believe that LMA is indeed as effective and versatile as claimed, and I'm much more confident about using it now. I still need to test it on larger problems (I'm considering MNIST), but I'm hopeful that it will remain as effective for these larger problems and networks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM