Neural Network converging to zero output

I am trying to train this neural network to make predictions on some data. I tried it on a small dataset (around 100 records) and it was working like a charm. Then I plugged the new dataset and I found out the NN converges to 0 output and the error converges approximately to the ratio between the number of positive examples and the total number of examples.

My dataset is composed by yes/no features (1.0/0.0) and the ground truth is yes/no as well.

My suppositions:
1) there's a local minimum with output 0 (but I tried with many values of the learning rate and init weights, it seems to converge always there)
2) my weight update is wrong (but looks good to me)
3) it is just an output scaling problem. I tried to scale the output (ie output/max(output) and output/mean(output)) but the results are not good as you can see in the code provided below. Should I scale it in a different way? Softmax?

here is the code:

import pandas as pd
import numpy as np
import pickle
import random
from collections import defaultdict

alpha = 0.1
N_ITER = 10

train = pd.read_csv("./data/prediction.csv")

y = train['y_true'].as_matrix()
y = np.vstack(y).astype(float)
ytest = y[18000:]
y = y[:18000]

X = train.drop(['y_true'], axis = 1).as_matrix()
Xtest = X[18000:].astype(float)
X = X[:18000]

def tanh(x,deriv=False):
        return (1 - np.tanh(x)**2) * alpha
        return np.tanh(x)

def sigmoid(x,deriv=False):
        return x*(1-x)
        return 1/(1+np.exp(-x))

def relu(x,deriv=False):
        return 0.01 + 0.99*(x>0)
        return 0.01*x + 0.99*x*(x>0)


syn = defaultdict(np.array)

for i in range(N_LAYERS-1):
    syn[i] = INIT_SCALE * np.random.random((len(X[0]),len(X[0]))) - INIT_SCALE/2
syn[N_LAYERS-1] = INIT_SCALE * np.random.random((len(X[0]),1)) - INIT_SCALE/2

l = defaultdict(np.array)
delta = defaultdict(np.array)

for j in xrange(N_ITER):
    l[0] = X
    for i in range(1,N_LAYERS+1):
        l[i] = relu(np.dot(l[i-1],syn[i-1]))

    error = (y - l[N_LAYERS])

    e = np.mean(np.abs(error))
    if (j% 1) == 0:
        print "\nIteration " + str(j) + " of " + str(N_ITER)
        print "Error: " + str(e)

    delta[N_LAYERS] = error*relu(l[N_LAYERS],deriv=True) * alpha
    for i in range(N_LAYERS-1,0,-1):
        error = delta[i+1].dot(syn[i].T)
        delta[i] = error*relu(l[i],deriv=True) * alpha

    for i in range(N_LAYERS):
        syn[i] += l[i].T.dot(delta[i+1])

pickle.dump(syn, open('neural_weights.pkl', 'wb'))

# TESTING with f1-measure

l[0] = Xtest
for i in range(1,N_LAYERS+1):
    l[i] = relu(np.dot(l[i-1],syn[i-1]))

out = l[N_LAYERS]/max(l[N_LAYERS])

tp = float(0)
fp = float(0)
fn = float(0)
tn = float(0)

for i in l[N_LAYERS][:50]:
    print i

for i in range(len(ytest)):
    if out[i] > 0.5 and ytest[i] == 1:
        tp += 1
    if out[i] <= 0.5 and ytest[i] == 1:
        fn += 1
    if out[i] > 0.5 and ytest[i] == 0:
        fp += 1
    if out[i] <= 0.5 and ytest[i] == 0:
        tn += 1

print "tp: " + str(tp)
print "fp: " + str(fp)
print "tn: " + str(tn)
print "fn: " + str(fn)

print "\nprecision: " + str(tp/(tp + fp))
print "recall: " + str(tp/(tp + fn))

f1 = 2 * tp /(2 * tp + fn + fp)
print "\nf1-measure:" + str(f1)

and this is the output:

Iteration 0 of 10
Error: 0.222500767998

Iteration 1 of 10
Error: 0.222500771157

Iteration 2 of 10
Error: 0.222500774321

Iteration 3 of 10
Error: 0.22250077749

Iteration 4 of 10
Error: 0.222500780663

Iteration 5 of 10
Error: 0.222500783841

Iteration 6 of 10
Error: 0.222500787024

Iteration 7 of 10
Error: 0.222500790212

Iteration 8 of 10
Error: 0.222500793405

Iteration 9 of 10
Error: 0.222500796602

[ 0.]
[ 0.]
[  5.58610895e-06]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[  4.62182626e-06]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[  5.58610895e-06]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[  4.62182626e-06]
[ 0.]
[ 0.]
[  5.04501079e-10]
[  5.58610895e-06]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[  5.04501079e-10]
[ 0.]
[ 0.]
[  4.62182626e-06]
[ 0.]
[  5.58610895e-06]
[ 0.]
[ 0.]
[ 0.]
[  5.58610895e-06]
[ 0.]
[ 0.]
[ 0.]
[  5.58610895e-06]
[ 0.]
[  1.31432294e-05]

tp: 28.0
fp: 119.0
tn: 5537.0
fn: 1550.0

precision: 0.190476190476
recall: 0.0177439797212


Based upon your model its unlikely you would need 10 layers for your network to converge.

Try a 3 layer network with more hidden nodes. For a majority of Feedforward problems you will only need 1 hidden layer to effectively converge.

Deep NN's are much more difficult to train then shallow ones.

Like others have said you learning rate should be much smaller [.01,.3] is a decent range, additionally the number of iterations needs to be much greater.

10 Layers is way too many.

