什么决定了我的 Python 梯度下降算法是否收敛？

Question

I've implemented a single-variable linear regression model in Python that uses gradient descent to find the intercept and slope of the best-fit line (I'm using gradient descent rather than computing the optimal values for intercept and slope directly because I'd eventually like to generalize to multiple regression).我已经在 Python 中实现了一个单变量线性回归模型，它使用梯度下降来找到最佳拟合线的截距和斜率（我使用梯度下降而不是直接计算截距和斜率的最佳值，因为我d 最终喜欢推广到多元回归）。

The data I am using are below.我使用的数据如下。 sales is the dependent variable (in dollars) and temp is the independent variable (degrees celsius) (think ice cream sales vs temperature, or something similar). sales是因变量（以美元为单位）， temp是自变量（摄氏度）（想想冰淇淋销售额与温度，或类似的东西）。

sales   temp
215     14.20
325     16.40
185     11.90
332     15.20
406     18.50
522     22.10
412     19.40
614     25.10
544     23.40
421     18.10
445     22.60
408     17.20

And this is my data after it has been normalized:这是我标准化后的数据：

sales        temp 
0.06993007  0.174242424
0.326340326 0.340909091
0           0
0.342657343 0.25
0.515151515 0.5
0.785547786 0.772727273
0.529137529 0.568181818
1           1
0.836829837 0.871212121
0.55011655  0.46969697
0.606060606 0.810606061
0.51981352  0.401515152

My code for the algorithm:我的算法代码：

import numpy as np
import pandas as pd
from scipy import stats

class SLRegression(object):
    def __init__(self, learnrate = .01, tolerance = .000000001, max_iter = 10000):

        # Initialize learnrate, tolerance, and max_iter.
        self.learnrate = learnrate
        self.tolerance = tolerance
        self.max_iter = max_iter

    # Define the gradient descent algorithm.
    def fit(self, data):
        # data   :   array-like, shape = [m_observations, 2_columns] 

        # Initialize local variables.
        converged = False
        m = data.shape[0]

        # Track number of iterations.
        self.iter_ = 0

        # Initialize theta0 and theta1.
        self.theta0_ = 0
        self.theta1_ = 0

        # Compute the cost function.
        J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
        print('J is: ', J)

        # Iterate over each point in data and update theta0 and theta1 on each pass.
        while not converged:
            diftemp0 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) for i in range(m)])
            diftemp1 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) * data[i][1] for i in range(m)])

            # Subtract the learnrate * partial derivative from theta0 and theta1.
            temp0 = self.theta0_ - (self.learnrate * diftemp0)
            temp1 = self.theta1_ - (self.learnrate * diftemp1)

            # Update theta0 and theta1.
            self.theta0_ = temp0
            self.theta1_ = temp1

            # Compute the updated cost function, given new theta0 and theta1.
            new_J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
            print('New J is: %s') % (new_J)

            # Test for convergence.
            if abs(J - new_J) <= self.tolerance:
                converged = True
                print('Model converged after %s iterations!') % (self.iter_)

            # Set old cost equal to new cost and update iter.
            J = new_J
            self.iter_ += 1

            # Test whether we have hit max_iter.
            if self.iter_ == self.max_iter:
                converged = True
                print('Maximum iterations have been reached!')

        return self

    def point_forecast(self, x):
        # Given feature value x, returns the regression's predicted value for y.
        return self.theta0_ + self.theta1_ * x


# Run the algorithm on a data set.
if __name__ == '__main__':
    # Load in the .csv file.
    data = np.squeeze(np.array(pd.read_csv('sales_normalized.csv')))

    # Create a regression model with the default learning rate, tolerance, and maximum number of iterations.
    slregression = SLRegression()

    # Call the fit function and pass in the data.
    slregression.fit(data)

    # Print out the results.
    print('After %s iterations, the model converged on Theta0 = %s and Theta1 = %s.') % (slregression.iter_, slregression.theta0_, slregression.theta1_)
    # Compare our model to scipy linregress model.
    slope, intercept, r_value, p_value, slope_std_error = stats.linregress(data[:,1], data[:,0])
    print('Scipy linear regression gives intercept: %s and slope = %s.') % (intercept, slope)

    # Test the model with a point forecast.
    print('As an example, our algorithm gives y = %s given x = .87.') % (slregression.point_forecast(.87)) # Should be about .83.
    print('The true y-value for x = .87 is about .8368.')

I'm having trouble understanding exactly what allows the algorithm to converge versus return values that are completely wrong.我无法准确理解是什么允许算法收敛而不是返回完全错误的值。 Given learnrate = .01 , tolerance = .0000000001 , and max_iter = 10000 , in combination with normalized data, I can get the gradient descent algorithm to converge.给定learnrate = .01 、 tolerance = .0000000001和max_iter = 10000 ，结合归一化数据，我可以让梯度下降算法收敛。 However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005 .但是，当我使用非标准化数据时，在没有算法返回NaN情况下，我可以使学习率最小的是.005 。 This brings changes in the cost function from iteration to iteration down to around 614 , but I can't get it to go any lower.这将每次迭代的成本函数的变化降低到614左右，但我无法让它变得更低。

Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why?这种类型的算法是否绝对需要标准化数据，如果是，为什么？ Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values?此外，考虑到算法需要标准化值，以非标准化形式获取新的x-value并将其插入点预测的最佳方法是什么？ For instance, if I were going to deliver this algorithm to a client so they could make predictions of their own (I'm not, but for the sake of argument..), wouldn't I want them to simply be able to plug in the un-normalized x-value ?例如，如果我要将这个算法交付给客户，以便他们可以做出自己的预测（我不是，但为了争论..），我是否不希望他们能够简单地插入在未归一化的x-value ？

All and all, playing around with the tolerance , max_iter , and learnrate gives me non-convergent results the majority of the time.总而言之，在大多数情况下，使用tolerance 、 max_iter和learnrate会给我带来不收敛的结果。 Is this normal, or are there flaws in my algorithm that are contributing to this issue?这是正常的，还是我的算法中存在导致此问题的缺陷？

Answer 1

Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge.给定learnrate = .01，tolerance = .0000000001，max_iter = 10000，结合归一化数据，我可以得到梯度下降算法收敛。 However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005但是，当我使用未归一化的数据时，在没有算法返回 NaN 的情况下，我可以使学习率最小为 0.005

That's kind of to be expected the way you have your algorithm set up.这在您设置算法的方式中是可以预料的。

The normalization of the data makes it so the y-intercept of the best fit is around 0.0.数据的归一化使得最佳拟合的 y 截距约为0.0。 Otherwise, you could have a y-intercept thousands of units off of the starting guess, and you'd have to trek there before you ever really started the optimization part.否则，您的 y 轴截距可能会偏离起始猜测的数千个单位，并且您必须在真正开始优化部分之前跋涉到那里。

Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why?这种类型的算法是否绝对需要标准化数据，如果是，为什么？

No, absolutely not, but if you don't normalize, you should pick a starting point more intelligently (you're starting at (m,b) = (0,0)).不，绝对不是，但如果你不规范化，你应该更聪明地选择一个起点（你从 (m,b) = (0,0) 开始）。 Your learnrate may also be too small if you don't normalize your data, and same with your tolerance.如果您不标准化数据，您的学习率也可能太小，并且您的容忍度也是如此。

Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values?此外，考虑到算法需要标准化值，以非标准化形式获取新的 x 值并将其插入点预测的最佳方法是什么？

Apply whatever transformation that you applied to the original data to get the normalized data to your new x-value.应用您对原始数据应用的任何转换，以将标准化数据转换为新的 x 值。 (The code for normalization is outside of what you have shown). （规范化代码不在您所展示的范围内）。 If this test point fell within the (minx,maxx) range of your original data, once transformed, it should fall within 0 <= x <= 1. Once you have this normalized test point, plug it into your theta equation of a line (remember, your thetas are m,b of the y-intercept form of the equation of a line).如果这个测试点落在原始数据的 (minx,maxx) 范围内，一旦转换，它应该落在 0 <= x <= 1 内。一旦你有了这个标准化的测试点，把它插入你的线的 theta 方程（请记住，您的 theta 是直线方程的 y 截距形式的 m,b）。

All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time.总而言之，在大多数情况下，使用容忍度、max_iter 和学习率会给我非收敛的结果。

For a well-formed problem, if you're in fact diverging it often means your step size is too large.对于格式良好的问题，如果您实际上在发散，则通常意味着您的步长太大。 Try lowering it.尝试降低它。

If it's simply not converging before it hits the max iterations, that could be a few issues:如果它在达到最大迭代之前根本没有收敛，那可能是几个问题：

Your step size is too small,你的步长太小，
Your tolerance is too small,你的容忍度太小，
Your max iterations is too small,您的最大迭代次数太小，
Your starting point is poorly chosen您的起点选择不当

In your case, using the non normalized data results in your starting point of (0,0) being very far off (the (m,b) of the non-normalized data is around (-159, 30) while the (m,b) of your normalized data is (0.10,0.79)), so most if not all of your iterations are being used just getting to the area of interest.在您的情况下，使用非标准化数据会导致 (0,0) 的起点非常远（非标准化数据的 (m,b) 约为 (-159, 30) 而 (m, b) 你的归一化数据是 (0.10,0.79))，所以大多数（如果不是全部）迭代都被用于到达感兴趣的区域。

The problem with this is that by increasing the step size to get to the area of interest faster also makes it less -likely to find convergence once it gets there.这样做的问题是，通过增加步长以更快地到达感兴趣的区域，一旦到达那里就不太可能找到收敛。

To account for this, some gradient descent algorithms have dynamic step size (or learnrate) such that large steps are taken at the beginning, and smaller ones as it nears convergence.考虑到这一点，一些梯度下降算法具有动态步长（或学习率），以便在开始时采用大步长，并在接近收敛时采用较小步长。

It may also be helpful for you to keep a history of of the theta pairs throughout the algorithm, then plot them.在整个算法中保留 theta 对的历史记录，然后绘制它们也可能对您有所帮助。 You'll be able to see the difference immediately between using normalized and non-normalized input data.您将能够立即看到使用标准化和非标准化输入数据之间的差异。

什么决定了我的 Python 梯度下降算法是否收敛？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-23 02:51:47

什么决定了我的 Python 梯度下降算法是否收敛？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-23 02:51:47

解决方案1
1 已采纳 2016-04-23 02:51:47