简体   繁体   中英

Relu Performing worse than sigmoid?

I use sigmoid on all the layers and output and get a final error rate of 0.00012 but when I use Relu which is theoretically Better , i get the worst results possible. can anyone explain why is this happening? I am using a very simple 2 layer implementation code available on 100 of websites but still am giving it below,

import numpy as np
#test
#avg(nonlin(np.dot(nonlin(np.dot([0,0,1],syn0)),syn1)))
#returns list >> [predicted_output, confidence]
def nonlin(x,deriv=False):#Sigmoid
    if(deriv==True):
        return x*(1-x)

    return 1/(1+np.exp(-x))

def relu(x, deriv=False):#RELU
    if (deriv == True):
        for i in range(0, len(x)):
            for k in range(len(x[i])):
                if x[i][k] > 0:
                    x[i][k] = 1
                else:
                    x[i][k] = 0
        return x
    for i in range(0, len(x)):
        for k in range(0, len(x[i])):
            if x[i][k] > 0:
                pass  # do nothing since it would be effectively replacing x with x
            else:
                x[i][k] = 0
    return x

X = np.array([[0,0,1],
            [0,0,0],  
            [0,1,1],
            [1,0,1],
            [1,0,0],
            [0,1,0]])

y = np.array([[0],[1],[0],[0],[1],[1]])

np.random.seed(1)

# randomly initialize our weights with mean 0
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1

def avg(i):
        if i > 0.5:
            confidence = i
            return [1,float(confidence)]
        else:
            confidence=1.0-float(i)
            return [0,confidence]
for j in xrange(500000):

    # Feed forward through layers 0, 1, and 2
    l0 = X
    l1 = nonlin(np.dot(l0,syn0Performing))
    l2 = nonlin(np.dot(l1,syn1))
    #print 'this is',l2,'\n'
    # how much did we miss the target value?
    l2_error = y - l2
    #print l2_error,'\n'
    if (j% 100000) == 0:
        print "Error:" + str(np.mean(np.abs(l2_error)))
        print syn1

    # in what direction is the target value?
    # were we really sure? if so, don't change too much.
    l2_delta = l2_error*nonlin(l2,deriv=True)

    # how much did each l1 value contribute to the l2 error (according to the weights)?
    l1_error = l2_delta.dot(syn1.T)

    # in what direction is the target l1?
    # were we really sure? if so, don't change too much.
    l1_delta = l1_error * nonlin(l1,deriv=True)

    syn1 += l1.T.dot(l2_delta)
    syn0 += l0.T.dot(l1_delta)
print "Final Error:" + str(np.mean(np.abs(l2_error)))
def p(l):
        return avg(nonlin(np.dot(nonlin(np.dot(l,syn0)),syn1)))

So p(x) is the predicton function after traning , where x is an 1 x 3 matrix of input values.

Why do you say it's theoretically better? In majority of applications ReLU has proven better, but it doesn't mean it is universally better. Your example is very simple and the input scaled between [0,1], same as the output. That's exactly where I would expect sigmoid to perform well. You don't meet sigmoids in hidden layers in practice due to the vanishing gradient problem and some other issues with large networks, but it's hardly an issue for you.

Also, if by any chance you were using the ReLU derivative, you were missing an 'else' in your code . Your derivative would be simple overwritten.

Just as a refresher, here's the definition of ReLU:

f(x)=max(0,x)

... meaning it can blow your activation to infinite . You want to avoid having ReLU on the last (output) layer.

On a side note, whenever possible you should take advantage of vectorised operations:

def relu(x, deriv=False):#RELU
    if (deriv == True):
        mask = x > 0
        x[mask] = 1
        x[~mask] = 0
    else: # HERE YOU WERE MISSING "ELSE"
        return np.maximum(0,x)

Yes, it's much faster then if / else you were doing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM