简体   繁体   English

为什么我们在代码中使用 sum 来表示偏差的梯度,为什么我们没有在代码中使用“权重”?

[英]Why we used the sum in the code for the gradient of the bias and why we didn't in the code of `the weight?

The code of the partial derivatives of the mean square error:均方误差的偏导代码:

w_grad = -(2 / n_samples)*(X.T.dot(y_true - y_pred))
b_grad = -(2 / n_samples)*np.sum(y_true - y_pred)

With n_samples as n, the samples number, y_true as the observations and y_pred as the predictionsn_samples作为 n,样本数, y_true作为观察值, y_pred作为预测值

My question is, why we used the sum for the gradient in the code of b ( b_grad ), and why we didn't in the code of w_grad ?我的问题是,为什么我们在 b ( b_grad ) 的代码中使用总和作为梯度,为什么我们没有在w_grad的代码中使用?

The original equation is:原方程为:

y = Sum( W_i*x_i + B_i )

https://latex.codecogs.com/gif.latex?%5Cvec%20y%3D%5Csum_%7Bi%3D0%7D%5E%7Bn%7Dw_ix_i +b_i https://latex.codecogs.com/gif.latex?%5Cvec%20y%3D%5Csum_%7Bi%3D0%7D%5E%7Bn%7Dw_ix_i +b_i

If you have ten features, then you have ten W s and ten B s, and the total number of variables are twenty.如果你有十个特征,那么你就有十个W和十个B ,变量总数是二十个。

But we can just sum all B_i into one variable, and the total number of variables becomes 10+1 = 11. It is done by adding one more dimension and fixing the last x be 1. The calculation becomes below:但是我们可以将所有B_i加到一个变量中,变量总数变为10+1 = 11。这是通过增加一个维度并将最后的x固定为1来完成的。计算如下:

https://latex.codecogs.com/gif.latex?%5Cvec%20y%3D%5Cbegin%7Bbmatrix%7Dw_0%5C%5C%5Cvdots%5C%5Cw_i%5C%5Cw_b%5Cend%7Bbmatrix%7D%5Cbegin%7Bbmatrix%7Dx_0%5C%20%5Chdots%5C%20x_i%5C1%5Cend%7Bbmatrix%7D https://latex.codecogs.com/gif.latex?%5Cvec%20y%3D%5Cbegin%7Bbmatrix%7Dw_0%5C%5C%5Cvdots%5C%5Cw_i%5C%5Cw_b%5Cend%7Bbmatrix%7D%5Cbegin%7Bbmatrix %7Dx_0%5C%20%5Chdots%5C%20x_i%5C1%5Cend%7Bbmatrix%7D

  • Someone with reputation please convert the links to image ![]URL有声誉的人请将链接转换为图片![]URL

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我们在L2正则化中添加偏差项? - Why we add bias term in L2 regularization? 人工神经网络为什么需要BIAS? 我们应该为每一层分别设置BIAS吗? - Why the BIAS is necessary in ANN? Should we have separate BIAS for each layer? 机器学习,为什么我们需要加权数据 - machine learning, why do we need to weight data 为什么我们必须在应用梯度下降之前反向传播错误? - Why do we have to backpropagate error before applying gradient descent? 为什么梯度下降时我们可以解析线性回归 - why gradient descent when we can solve linear regression analytically 如果我们可以在 WGAN 中裁剪梯度,为什么还要使用 WGAN-GP? - If we can clip gradient in WGAN, why bother with WGAN-GP? 为什么在激活函数的反向传播过程中,我们应该通过梯度而不是除法来增加增量? - Why should we multpiply delta by gradient instead of division, during backpropagation of activation function? 当我不规范机器学习中的偏见而无需编写偏见代码时会发生什么? - What can happen, when I don't regularize bias in machine learning, no need to write code for bias? 为什么我们不使用验证数据集而不是测试数据集? - Why don't we use validation dataset instead of test dataset? 为什么我们不应该在同一层使用多个激活函数? - Why shouldn't we use multiple activation functions in the same layer?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM