简体   繁体   English

坚持实施简单的神经网络

[英]Stuck implementing simple neural network

I've been bashing my head against this brick wall for what seems like an eternity, and I just can't seem to wrap my head around it. 我一直在用这个砖墙砸我的头看似永恒,我似乎无法绕过它。 I'm trying to implement an autoencoder using only numpy and matrix multiplication. 我正在尝试仅使用numpy和矩阵乘法来实现自动编码器。 No theano or keras tricks allowed. 没有theano或keras技巧允许。

I'll describe the problem and all its details. 我将描述问题及其所有细节。 It is a bit complex at first since there are a lot of variables, but it really is quite straightforward. 它起初有点复杂,因为有很多变量,但它确实非常简单。

What we know 我们知道什么

1) X is an m by n matrix which is our inputs. 1) Xm × n矩阵,它是我们的输入。 The inputs are rows of this matrix. 输入是该矩阵的行。 Each input is an n dimensional row vector, and we have m of them. 每个输入都是一个n维行向量,我们有m个。

2)The number of neurons in our (single) hidden layer, which is k . 2)我们(单个)隐藏层中的神经元数量,即k

3) The activation function of our neurons (sigmoid, will be denoted as g(x) ) and its derivative g'(x) 3)我们的神经元的激活功能(乙状结肠,将表示为g(x) )及其衍生物g'(x)

What we don't know and want to find 我们不知道什么,想找到什么

Overall our goal is to find 6 matrices: w1 which is n by k , b1 which is m by k , w2 which is k by n , b2 which is m by n , w3 which is n by n and b3 which is m by n . 总的来说,我们的目标是找到6个矩阵: w1nkb1mkw2kn ,b2是mnw3nnb3mn

They are initallized randomly and we find the best solution using gradient descent. 它们随机初始化,我们找到使用梯度下降的最佳解决方案。

The process 这个过程

The entire process looks something like this 整个过程看起来像这样 在此输入图像描述

First we compute z1 = Xw1+b1 . 首先我们计算z1 = Xw1+b1 It is m by k and is the input to our hidden layer. 它是m乘以k并且是隐藏层的输入。 We then compute h1 = g(z1) , which is simply applying the sigmoid function to all elements of z1 . 然后我们计算h1 = g(z1) ,它只是将sigmoid函数应用于z1所有元素。 naturally it is also m by k and is the output of our hidden layer. 自然它也是m乘以k并且是我们隐藏层的输出。

We then compute z2 = h1w2+b2 which is m by n and is the input to the output layer of our neural network. 然后我们计算z2 = h1w2+b2 ,它是m乘以n并且是我们神经网络输出层的输入。 Then we compute h2 = g(z2) which again is naturally also m by n and is the output of our neural network. 然后我们计算h2 = g(z2) ,它自然也是m乘以n并且是我们神经网络的输出。

Finally, we take this output and perform some linear operator on it: Xhat = h2w3+b3 which is also m by n and is our final result. 最后,我们获取此输出并对其执行一些线性运算符: Xhat = h2w3+b3也是m乘以n并且是我们的最终结果。

Where I am stuck 我被卡住了

The cost function I want to minimize is the mean squared error. 我想要最小化的成本函数是均方误差。 I already implemented it in numpy code 我已经用numpy代码实现了它

def cost(x, xhat):
    return (1.0/(2 * m)) * np.trace(np.dot(x-xhat,(x-xhat).T))

The problem is finding the derivatives of cost with respect to w1,b1,w2,b2,w3,b3 . 问题是找到关于w1,b1,w2,b2,w3,b3的成本的导数。 Let's call the cost S . 我们称之为成本S

After deriving myself and checking myself numerically , I have established the following facts: 在得出自己并以数字方式检查自己之后 ,我确定了以下事实:

1) dSdxhat = (1/m) * np.dot(xhat-x) 1) dSdxhat = (1/m) * np.dot(xhat-x)

2) dSdw3 = np.dot(h2.T,dSdxhat) 2) dSdw3 = np.dot(h2.T,dSdxhat)

3) dSdb3 = dSdxhat 3) dSdb3 = dSdxhat

4) dSdh2 = np.dot(dSdxhat, w3.T) 4) dSdh2 = np.dot(dSdxhat, w3.T)

But I can't for the life of me figure out dSdz2. 但我不能为我的生活弄清楚dSdz2。 It's a brick wall. 这是一堵砖墙。

From chain-rule, it should be that dSdz2 = dSdh2 * dh2dz2 but the dimensions don't match. 从链规则来看,应该是dSdz2 = dSdh2 * dh2dz2,但尺寸不匹配。

What is the formula to compute the derivative of S with respect to z2? 计算S相对于z2的导数的公式是什么?

Edit - This is my code for the entire feed forward operation of the autoencoder. 编辑 - 这是我自动编码器的整个前馈操作的代码。

import numpy as np

def g(x): #sigmoid activation functions
    return 1/(1+np.exp(-x)) #same shape as x!

def gGradient(x): #gradient of sigmoid
    return g(x)*(1-g(x)) #same shape as x!

def cost(x, xhat): #mean squared error between x the data and xhat the output of the machine
    return (1.0/(2 * m)) * np.trace(np.dot(x-xhat,(x-xhat).T))

#Just small random numbers so we can test that it's working small scale
m = 5 #num of examples
n = 2 #num of features in each example
k = 2 #num of neurons in the hidden layer of the autoencoder
x = np.random.rand(m, n) #the data, shape (m, n)

w1 = np.random.rand(n, k) #weights from input layer to hidden layer, shape (n, k)
b1 = np.random.rand(m, k) #bias term from input layer to hidden layer (m, k)
z1 = np.dot(x,w1)+b1 #output of the input layer, shape (m, k)
h1 = g(z1) #input of hidden layer, shape (m, k)

w2 = np.random.rand(k, n) #weights from hidden layer to output layer of the autoencoder, shape (k, n)
b2 = np.random.rand(m, n) #bias term from hidden layer to output layer of autoencoder, shape (m, n)
z2 = np.dot(h1, w2)+b2 #output of the hidden layer, shape (m, n)
h2 = g(z2) #Output of the entire autoencoder. The output layer of the autoencoder. shape (m, n)

w3 = np.random.rand(n, n) #weights from output layer of autoencoder to entire output of the machine, shape (n, n)
b3 = np.random.rand(m, n) #bias term from output layer of autoencoder to entire output of the machine, shape (m, n)
xhat = np.dot(h2, w3)+b3 #the output of the machine, which hopefully resembles the original data x, shape (m, n)

OK, here's a suggestion. 好的,这是一个建议。 In the vector case, if you have x as a vector of length n , then g(x) is also a vector of length n . 在向量的情况下,如果你有x作为长度为n的向量,那么g(x)也是长度为n的向量。 However, g'(x) is not a vector, it's the Jacobian matrix , and will be of size n X n . 然而, g'(x)不是矢量,它是雅可比矩阵 ,并且其大小为n X n Similarly, in the minibatch case, where X is a matrix of size m X n , g(X) is m X n but g'(X) is n X n . 类似地,在小批量情况下,其中X是大小为m X n的矩阵, g(X)m X ng'(X)n X n Try: 尝试:

def gGradient(x): #gradient of sigmoid
    return np.dot(g(x).T, 1 - g(x))

@Paul is right that the bias terms should be vectors, not matrices. @Paul是对的,偏见项应该是向量,而不是矩阵。 You should have: 你应该有:

b1 = np.random.rand(k) #bias term from input layer to hidden layer (k,)
b2 = np.random.rand(n) #bias term from hidden layer to output layer of autoencoder, shape (n,)
b3 = np.random.rand(n) #bias term from output layer of autoencoder to entire output of the machine, shape (n,)

Numpy's broadcasting means that you don't have to change your calculation of xhat . Numpy的广播意味着您无需更改xhat的计算。

Then (I think!) you can compute the derivatives like this: 然后(我想!)你可以像这样计算衍生物:

dSdxhat = (1/float(m)) * (xhat-x)
dSdw3 = np.dot(h2.T,dSdxhat)
dSdb3 = dSdxhat.mean(axis=0)
dSdh2 = np.dot(dSdxhat, w3.T)
dSdz2 = np.dot(dSdh2, gGradient(z2))
dSdb2 = dSdz2.mean(axis=0)
dSdw2 = np.dot(h1.T,dSdz2)
dSdh1 = np.dot(dSdz2, w2.T)
dSdz1 = np.dot(dSdh1, gGradient(z1))
dSdb1 = dSdz1.mean(axis=0)
dSdw1 = np.dot(x.T,dSdz1)

Does this work for you? 这对你有用吗?

Edit 编辑

I've decided that I'm not at all sure that gGradient is supposed to be a matrix. 我已经决定,我完全不确定gGradient应该是一个矩阵。 How about: 怎么样:

dSdxhat = (xhat-x) / m
dSdw3 = np.dot(h2.T,dSdxhat)
dSdb3 = dSdxhat.sum(axis=0)
dSdh2 = np.dot(dSdxhat, w3.T)
dSdz2 = h2 * (1-h2) * dSdh2
dSdb2 = dSdz2.sum(axis=0)
dSdw2 = np.dot(h1.T,dSdz2)
dSdh1 = np.dot(dSdz2, w2.T)
dSdz1 = h1 * (1-h1) * dSdh1
dSdb1 = dSdz1.sum(axis=0)
dSdw1 = np.dot(x.T,dSdz1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM