简体   繁体   English

在 Python 中使用随机梯度下降的岭回归

[英]Ridge regression using stochastic gradient descent in Python

I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver.我正在尝试使用随机梯度下降作为求解器在 Python 中实现岭回归的解决方案。 My code for SGD is as follows:我的 SGD 代码如下:

def fit(self, X, Y):
    # Convert to data frame in case X is numpy matrix
    X = pd.DataFrame(X)

    # Define a function to calculate the error given a weight vector beta and a training example xi, yi

    # Prepend a column of 1s to the data for the intercept
    X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))

    # Find dimensions of train
    m, d = X.shape

    # Initialize weights to random
    beta = self.initializeRandomWeights(d)
    beta_prev = None

    epochs = 0
    prev_error = None
    while (beta_prev is None or epochs < self.nb_epochs):
        print("## Epoch: " + str(epochs))
        indices = range(0, m)
        shuffle(indices)
        for i in indices:   # Pick a training example from a randomly shuffled set
            beta_prev = beta
            xi = X.iloc[i]
            errori = sum(beta*xi) - Y[i]    # Error[i] = sum(beta*x) - y = error of ith training example
            gradient_vector = xi*errori + self.l*beta_prev
            beta = beta_prev - self.alpha*gradient_vector
        epochs += 1

The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values.我正在测试的数据没有标准化,我的实现总是以所有权重为无穷大,即使我将权重向量初始化为低值。 Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.只有当我将学习率 alpha 设置为非常小的值 ~1e-8 时,算法才会以权重向量的有效值结束。

My understanding is that normalizing/scaling input features only helps reduce convergence time.我的理解是规范化/缩放输入特征只会有助于减少收敛时间。 But the algorithm should not fail to converge as a whole if the features are not normalized.但是如果特征没有被归一化,算法应该不会不能作为一个整体收敛。 Is my understanding correct?我的理解正确吗?

You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling . 您可以从scikit-learn的“随机梯度下降”文档中检查该算法的缺点之一是它对特征缩放敏感。 In general, gradient based optimization algorithms converge faster on normalized data. 通常,基于梯度的优化算法在归一化数据上收敛更快。

Also, normalization is advantageous for regression methods. 而且,归一化对于回归方法是有利的。

The updates to the coefficients during each step will depend on the ranges of each feature. 每个步骤中系数的更新将取决于每个特征的范围。 Also, the regularization term will be affected heavily by large feature values. 同样,正则项会受到较大特征值的严重影响。

SGD may converge without data normalization, but that is subjective to the data at hand. SGD 可能会在没有数据标准化的情况下收敛,但这取决于手头的数据。 Therefore, your assumption is not correct. 因此,您的假设是不正确的。

Your assumption is not correct. 您的假设不正确。

It's hard to answer this, because there are so many different methods/environments but i will try to mention some points. 很难回答这个问题,因为有太多不同的方法/环境,但是我会尝试提几点。

Normalization 正常化

  • When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data 当某种方法不是尺度不变的(我认为每个线性回归都不是)时,您真的应该对数据进行规范化
    • I take it that you are just ignoring this because of debugging / analyzing 我认为您只是因为调试/分析而忽略了这一点
  • Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)! 对数据进行规范化不仅与收敛时间有关,结果也将有所不同(请考虑损失函数内的影响;大的值可能会导致较小的损失更大)!

Convergence 收敛

  • There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special: 关于规范化/非规范化数据的许多方法的收敛性可能有很多要说的,但您的情况很特殊:
    • SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay) 对于某些超参数的选择(学习率和学习进度/衰减),SGD的收敛理论仅保证收敛到某个局部最小值(=凸优化问题中的全局最小值)。
    • Even optimizing normalized data can fail with SGD when those params are bad! 当这些参数不好时,即使优化归一化数据也可能无法使用SGD!
      • This is one of the most important downsides of SGD; 这是SGD最重要的缺点之一; dependence on hyper-parameters 对超参数的依赖
    • As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence! 由于SGD基于梯度和步长,因此非规范化数据可能会对未实现这种收敛产生巨大影响!

In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter ), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.为了使 sgd 在线性回归中收敛,步长应小于 2/s,其中 s 是矩阵的最大奇异值(请参阅https://en.m.wikipedia. org/wiki/Least_mean_squares_filter ),在岭回归的情况下,它应该小于 2*(1+p/s^2)/s,其中 p 是岭惩罚。

Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter ).对矩阵(或梯度)的行进行归一化会更改损失 function 以赋予每个样本相同的权重,并且它会更改矩阵的奇异值,以便您可以选择接近 1 的步长(请参阅https://en中的 NLMS 部分.m.wikipedia.org/wiki/Least_mean_squares_filter )。 Depending on your data it might require smaller step sizes or allow for larger step sizes.根据您的数据,它可能需要更小的步长或允许更大的步长。 It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.这完全取决于归一化是增加还是减少矩阵的最大奇异值。

Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, ussually it makes sense to normalize but sometimes (for example when you want to give different importance for different samples) it doesn't.请注意,在决定是否对行进行归一化时,您不应该只考虑收敛速度(由最大和最小奇异值之间的比率决定)或均值的稳定性,还要考虑它如何改变损失function 以及它是否适合您的需求,通常标准化是有意义的,但有时(例如,当您想为不同的样本赋予不同的重要性时)它不适合。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM