Neural network backprop not fully training

Question

I have this neural network that I've trained seen bellow, it works, or at least appears to work, but the problem is with the training. I'm trying to train it to act as an OR gate, but it never seems to get there, the output tends to looks like this:

prior to training:

 [[0.50181624]
 [0.50183743]
 [0.50180414]
 [0.50182533]]

post training:

 [[0.69641759]
 [0.754652  ]
 [0.75447178]
 [0.79431198]]

expected output:

 [[0]
 [1]
 [1]
 [1]]

I have this loss graph:

Its strange it appears to be training, but at the same time not quite getting to the expected output. I know that it would never really achieve the 0s and 1s, but at the same time I expect it to manage and get something a little bit closer to the expected output.

I had some issues trying to figure out how to back prop the error as I wanted to make this network have any number of hidden layers, so I stored the local gradient in a layer, along side the weights, and sent the error from the end back.

The main functions I suspect are the culprits are NeuralNetwork.train and both forward methods.

import sys
import math
import numpy as np
import matplotlib.pyplot as plt
from itertools import product


class NeuralNetwork:
    class __Layer:
        def __init__(self,args):
            self.__epsilon = 1e-6
            self.localGrad = 0
            self.__weights = np.random.randn(
                args["previousLayerHeight"],
                args["height"]
            )*0.01
            self.__biases = np.zeros(
                (args["biasHeight"],1)
            )

        def __str__(self):
            return str(self.__weights)

        def forward(self,X):
            a = np.dot(X, self.__weights) + self.__biases
            self.localGrad = np.dot(X.T,self.__sigmoidPrime(a))
            return self.__sigmoid(a)

        def adjustWeights(self, err):
            self.__weights -= (err * self.__epsilon)

        def __sigmoid(self, z):
            return 1/(1 + np.exp(-z))

        def __sigmoidPrime(self, a):
            return self.__sigmoid(a)*(1 - self.__sigmoid(a))

    def __init__(self,args):
        self.__inputDimensions = args["inputDimensions"]
        self.__outputDimensions = args["outputDimensions"]
        self.__hiddenDimensions = args["hiddenDimensions"]
        self.__layers = []
        self.__constructLayers()

    def __constructLayers(self):
        self.__layers.append(
            self.__Layer(
                {
                    "biasHeight": self.__inputDimensions[0],
                    "previousLayerHeight": self.__inputDimensions[1],
                    "height": self.__hiddenDimensions[0][0] 
                        if len(self.__hiddenDimensions) > 0 
                        else self.__outputDimensions[0]
                }
            )
        )

        for i in range(len(self.__hiddenDimensions)):
            self.__layers.append(
                self.__Layer(
                    {
                        "biasHeight": self.__hiddenDimensions[i + 1][0] 
                            if i + 1 < len(self.__hiddenDimensions)
                            else self.__outputDimensions[0],
                        "previousLayerHeight": self.__hiddenDimensions[i][0],
                        "height": self.__hiddenDimensions[i + 1][0] 
                            if i + 1 < len(self.__hiddenDimensions)
                            else self.__outputDimensions[0]
                    }
                )
            )

    def forward(self,X):
        out = self.__layers[0].forward(X)
        for i in range(len(self.__layers) - 1):
            out = self.__layers[i+1].forward(out)
        return out  

    def train(self,X,Y,loss,epoch=5000000):
        for i in range(epoch):
            YHat = self.forward(X)
            delta = -(Y-YHat)
            loss.append(sum(Y-YHat))
            err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
            err.shape = (self.__hiddenDimensions[-1][0],1)
            self.__layers[-1].adjustWeights(err)
            i=0
            for l in reversed(self.__layers[:-1]):
                err = np.dot(l.localGrad, err)
                l.adjustWeights(err)
                i += 1

    def printLayers(self):
        print("Layers:\n")
        for l in self.__layers:
            print(l)
            print("\n")

def main(args):
    X = np.array([[x,y] for x,y in product([0,1],repeat=2)])
    Y = np.array([[0],[1],[1],[1]])
    nn = NeuralNetwork(
        {
            #(height,width)
            "inputDimensions": (4,2),
            "outputDimensions": (1,1),
            "hiddenDimensions":[
                (6,1)
            ]
        }
    )

    print("input:\n\n",X,"\n")
    print("expected output:\n\n",Y,"\n")
    nn.printLayers()
    print("prior to training:\n\n",nn.forward(X), "\n")
    loss = []
    nn.train(X,Y,loss)
    print("post training:\n\n",nn.forward(X), "\n")
    nn.printLayers()
    fig,ax = plt.subplots()

    x = np.array([x for x in range(5000000)])
    loss = np.array(loss)
    ax.plot(x,loss)
    ax.set(xlabel="epoch",ylabel="loss",title="logic gate training")

    plt.show()

if(__name__=="__main__"):
    main(sys.argv[1:])

Could someone please point out what I'm doing wrong here, I strongly suspect it has to do with the way I'm dealing with matrices but at the same time I don't have the slightest idea what's going on.

Thanks for taking the time to read my question, and taking the time to respond (if relevant).

edit: Actually quite a lot is wrong with this but I'm still a bit confused over how to fix it. Although the loss graph looks like its training, and it kind of is, the math I've done above is wrong.

Look at the training function.

def train(self,X,Y,loss,epoch=5000000):
        for i in range(epoch):
            YHat = self.forward(X)
            delta = -(Y-YHat)
            loss.append(sum(Y-YHat))
            err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
            err.shape = (self.__hiddenDimensions[-1][0],1)
            self.__layers[-1].adjustWeights(err)
            i=0
            for l in reversed(self.__layers[:-1]):
                err = np.dot(l.localGrad, err)
                l.adjustWeights(err)
                i += 1

Note how I get delta = -(Y-Yhat) and then dot product it with the "local gradient" of the last layer. The "local gradient" is the local W gradient.

def forward(self,X):
    a = np.dot(X, self.__weights) + self.__biases
    self.localGrad = np.dot(X.T,self.__sigmoidPrime(a))
    return self.__sigmoid(a)

I'm skipping a step in the chain rule. I should really be multiplying by W* sigprime(XW + b) first as that's the local gradient of X, then by the local W gradient. I tried that, but I'm still getting issues, here is the new forward method (note the __init__ for layers needs to be initialised for the new vars, and I changed the activation function to tanh)

def forward(self, X):
    a = np.dot(X, self.__weights) + self.__biases
    self.localPartialGrad = self.__tanhPrime(a)
    self.localWGrad = np.dot(X.T, self.localPartialGrad)
    self.localXGrad = np.dot(self.localPartialGrad,self.__weights.T)            
    return self.__tanh(a)

and updated the training method to look something like this:

def train(self, X, Y, loss, epoch=5000):
    for e in range(epoch):
        Yhat = self.forward(X)
        err = -(Y-Yhat)
        loss.append(sum(err))
        print("loss:\n",sum(err))
        for l in self.__layers[::-1]:
            l.adjustWeights(err)
            if(l != self.__layers[0]):
                err = np.multiply(err,l.localPartialGrad)
                err = np.multiply(err,l.localXGrad)

The new graphs I'm getting are all over the place, I have no idea what's going on. Here is the final bit of code I changed:

def adjustWeights(self, err):
    perr = np.multiply(err, self.localPartialGrad)  
    werr = np.sum(np.dot(self.__weights,perr.T),axis=1)
    werr = werr * self.__epsilon
    werr.shape = (self.__weights.shape[0],1)
    self.__weights = self.__weights - werr

Answer 1

Your network is learning, as can be seen from the loss chart, so backprop implementation is correct (congrats!). The main problem with this particular architecture is the choice of the activation function: sigmoid . I have replaced sigmoid with tanh and it works much better instantly.

From this discussion on CV.SE :

There are two reasons for that choice (assuming you have normalized your data, and this is very important):

Having stronger gradients: since data is centered around 0, the derivatives are higher. To see this, calculate the derivative of the tanh function and notice that input values are in the range [0,1]. The range of the tanh function is [-1,1] and that of the sigmoid function is [0,1]

Avoiding bias in the gradients. This is explained very well in the paper, and it is worth reading it to understand these issues.

Though I'm sure sigmoid -based NN can be trained as well, looks like it's much more sensitive to input values (note that they are not zero-centered), because the activation itself is not zero-centered. tanh is better than sigmoid by all means, so a simpler approach is just use that activation function.

The key change is this:

def __tanh(self, z):
  return np.tanh(z)

def __tanhPrime(self, a):
  return 1 - self.__tanh(a) ** 2

... instead of __sigmoid and __sigmoidPrime .

I have also tuned hyperparameters a little bit, so that the network now learns in 100k epochs, instead of 5m:

prior to training:

 [[ 0.        ]
 [-0.00056925]
 [-0.00044885]
 [-0.00101794]] 

post training:

 [[0.        ]
 [0.97335842]
 [0.97340917]
 [0.98332273]]

A complete code is in this gist .

Answer 2

Well I'm an idiot. I was right about being wrong but I was wrong about how wrong I was. Let me explain.

Within the backwards training method I got the last layer trained correctly, but all layers after that wasn't trained correctly, hence why the above network was coming up with a result, it was indeed training, but only one layer.

So what did i do wrong? Well I was only multiplying by the local graident of the Weights with respect to the output, and thus the chain rule was partially correct.

Lets say the loss function was this:

t = Y-X2

loss = 1/2*(t)^2

a2 = X1W2 + b

X2 = activation(a2)

a1 = X0W1 + b

X1 = activation(a1)

We know that the the derivative of loss with respect to W2 would be -(Y-X2)*X1. This was done in the first part of my training function:

def train(self,X,Y,loss,epoch=5000000):
    for i in range(epoch):
        #First part
        YHat = self.forward(X)
        delta = -(Y-YHat)
        loss.append(sum(Y-YHat))
        err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
        err.shape = (self.__hiddenDimensions[-1][0],1)
        self.__layers[-1].adjustWeights(err)
        i=0
        #Second part
        for l in reversed(self.__layers[:-1]):
            err = np.dot(l.localGrad, err)
            l.adjustWeights(err)
            i += 1

However the second part is where I screwed up. In order to calculate the loss with respect to W1, I must multiply the original error -(Y-X2) by W2 as W2 is the local X Gradient of the last layer, and due to the chain rule this must be done first. Then I could multiply by the local W gradient (X1) to get the loss with respect to W1. I failed to do the multiplication of the local X gradient first, so the last layer was indeed training, but all layers after that had an error that magnified as the layer increased.

To solve this I updated the train method:

def train(self,X,Y,loss,epoch=10000):
    for i in range(epoch):
        YHat = self.forward(X)
        err = -(Y-YHat)
        loss.append(sum(Y-YHat))
        werr = np.sum(np.dot(self.__layers[-1].localWGrad,err.T), axis=1)
        werr.shape = (self.__hiddenDimensions[-1][0],1)
        self.__layers[-1].adjustWeights(werr)
        for l in reversed(self.__layers[:-1]):
            err = np.multiply(err, l.localXGrad)
            werr = np.sum(np.dot(l.weights,err.T),axis=1)
            l.adjustWeights(werr)

Now the loss graph I got looks like this:

Neural network backprop not fully training

Question

2 answers

solution1
8 ACCPTED 2018-02-27 18:17:01

solution2
1 2018-03-01 01:56:06

Neural network backprop not fully training

Question

2 answers

solution1 8 ACCPTED 2018-02-27 18:17:01

solution2 1 2018-03-01 01:56:06

solution1
8 ACCPTED 2018-02-27 18:17:01

solution2
1 2018-03-01 01:56:06