[英]Why we used the sum in the code for the gradient of the bias and why we didn't in the code of `the weight?
The code of the partial derivatives of the mean square error:均方误差的偏导代码:
w_grad = -(2 / n_samples)*(X.T.dot(y_true - y_pred))
b_grad = -(2 / n_samples)*np.sum(y_true - y_pred)
With n_samples
as n, the samples number, y_true
as the observations and y_pred
as the predictions以
n_samples
作为 n,样本数, y_true
作为观察值, y_pred
作为预测值
My question is, why we used the sum for the gradient in the code of b ( b_grad
), and why we didn't in the code of w_grad
?我的问题是,为什么我们在 b (
b_grad
) 的代码中使用总和作为梯度,为什么我们没有在w_grad
的代码中使用?
The original equation is:原方程为:
y = Sum( W_i*x_i + B_i )
https://latex.codecogs.com/gif.latex?%5Cvec%20y%3D%5Csum_%7Bi%3D0%7D%5E%7Bn%7Dw_ix_i +b_i https://latex.codecogs.com/gif.latex?%5Cvec%20y%3D%5Csum_%7Bi%3D0%7D%5E%7Bn%7Dw_ix_i +b_i
If you have ten features, then you have ten W
s and ten B
s, and the total number of variables are twenty.如果你有十个特征,那么你就有十个
W
和十个B
,变量总数是二十个。
But we can just sum all B_i
into one variable, and the total number of variables becomes 10+1 = 11. It is done by adding one more dimension and fixing the last x
be 1. The calculation becomes below:但是我们可以将所有
B_i
加到一个变量中,变量总数变为10+1 = 11。这是通过增加一个维度并将最后的x
固定为1来完成的。计算如下:
https://latex.codecogs.com/gif.latex?%5Cvec%20y%3D%5Cbegin%7Bbmatrix%7Dw_0%5C%5C%5Cvdots%5C%5Cw_i%5C%5Cw_b%5Cend%7Bbmatrix%7D%5Cbegin%7Bbmatrix%7Dx_0%5C%20%5Chdots%5C%20x_i%5C1%5Cend%7Bbmatrix%7D https://latex.codecogs.com/gif.latex?%5Cvec%20y%3D%5Cbegin%7Bbmatrix%7Dw_0%5C%5C%5Cvdots%5C%5Cw_i%5C%5Cw_b%5Cend%7Bbmatrix%7D%5Cbegin%7Bbmatrix %7Dx_0%5C%20%5Chdots%5C%20x_i%5C1%5Cend%7Bbmatrix%7D
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.