为什么简单的梯度下降发散？

Question

This is my second attempt at implementing gradient descent in one variable and it always diverges. 这是我在一个变量中实现梯度下降的第二次尝试，它总是发散。 Any ideas? 有任何想法吗？

This is simple linear regression for minimizing the residual sum of squares in one variable. 这是简单的线性回归，用于最小化一个变量中的残差平方和。

def gradient_descent_wtf(xvalues, yvalues):
    tolerance = 0.1

    #y=mx+b
    #some line to predict y values from x values
    m=1.
    b=1.

    #a predicted y-value has value mx + b

    for i in range(0,10):

        #calculate y-value predictions for all x-values
        predicted_yvalues = list()
        for x in xvalues:
            predicted_yvalues.append(m*x + b)

        # predicted_yvalues holds the predicted y-values

        #now calculate the residuals = y-value - predicted y-value for each point
        residuals = list()
        number_of_points = len(yvalues)
        for n in range(0,number_of_points):
            residuals.append(yvalues[n] - predicted_yvalues[n])

        ## calculate the residual sum of squares from the residuals, that is,
        ## square each residual and add them all up. we will try to minimize
        ## the residual sum of squares later.
        residual_sum_of_squares = 0.
        for r in residuals:
            residual_sum_of_squares += r**2
        print("RSS = %s" % residual_sum_of_squares)
        ##
        ##
        ##

        #now make a version of the residuals which is multiplied by the x-values
        residuals_times_xvalues = list()
        for n in range(0,number_of_points):
            residuals_times_xvalues.append(residuals[n] * xvalues[n])

        #now create the sums for the residuals and for the residuals times the x-values
        residuals_sum = sum(residuals)

        residuals_times_xvalues_sum = sum(residuals_times_xvalues)

        # now multiply the sums by a positive scalar and add each to m and b.

        residuals_sum *= 0.1
        residuals_times_xvalues_sum *= 0.1

        b += residuals_sum
        m += residuals_times_xvalues_sum

        #and repeat until convergence.
        #convergence occurs when ||sum vector|| < some tolerance.
        # ||sum vector|| = sqrt( residuals_sum**2 + residuals_times_xvalues_sum**2 )

        #check for convergence
        magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
        if magnitude_of_sum_vector < tolerance:
            break

    return (b, m)

Result: 结果：

gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 300170125.7
RSS = 4.86943013045e+11
RSS = 7.90447409339e+14
RSS = 1.28312217794e+18
RSS = 2.08287421094e+21
RSS = 3.38110045417e+24
RSS = 5.48849288217e+27
RSS = 8.90939341376e+30
RSS = 1.44624932026e+34
Out[108]:
(-3.475524066284303e+16, -2.4195981188763203e+17)

Answer 1

The gradients are huge -- hence you are following large vectors for long distances (0.1 times a large number is large). 渐变是巨大的 - 因此您需要跟踪长距离的大矢量（大数的0.1倍大）。 Find unit vectors in the appropriate direction. 在适当的方向上查找单位向量。 Something like this (with comprehensions replacing your loops): 像这样的东西（理解力取代你的循环）：

def gradient_descent_wtf(xvalues, yvalues):
    tolerance = 0.1

    m=1.
    b=1.

    for i in range(0,10):
        predicted_yvalues = [m*x+b for x in xvalues]

        residuals = [y-y_hat for y,y_hat in zip(yvalues,predicted_yvalues)]

        residual_sum_of_squares = sum(r**2 for r in residuals) #only needed for debugging purposes
        print("RSS = %s" % residual_sum_of_squares)

        residuals_times_xvalues = [r*x for r,x in zip(residuals,xvalues)]

        residuals_sum = sum(residuals)

        residuals_times_xvalues_sum = sum(residuals_times_xvalues)

        # (residuals_sum,residual_times_xvalues_sum) is a vector which points in the negative
        # gradient direction. *Find a unit vector which points in same direction*

        magnitude = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5

        residuals_sum /= magnitude
        residuals_times_xvalues_sum /= magnitude

        b += residuals_sum * (0.1)
        m += residuals_times_xvalues_sum * (0.1)

        #check for convergence -- this needs work!
        magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
        if magnitude_of_sum_vector < tolerance:
            break

    return (b, m)

For example: 例如：

>>> gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 368732.1655050716
RSS = 367039.18363896786
RSS = 365354.0543519137
RSS = 363676.7775934381
RSS = 362007.3533123621
RSS = 360345.7814567845
RSS = 358692.061974069
RSS = 357046.1948108295
RSS = 355408.17991291644
(1.1157111313023558, 1.9932828425473605)

which is certainly much more plausible. 这当然更合理。

It isn't a trivial matter to make a numerically stable gradient-descent algorithm. 制作数值稳定的梯度下降算法并非易事。 You might want to consult a decent textbook in numerical analysis. 您可能想在数值分析中查阅一本体面的教科书。

Answer 2

First, Your code is right. 首先，你的代码是对的。

But you should consider something about math when you do linear regression. 但是当你做线性回归时，你应该考虑一些关于数学的东西。

For example, the residual is -205.8 and your learning rate is 0.1 so you will get a huge descent step -25.8 . 例如，残差为-205.8 ，你的学习率为0.1，因此你将获得一个巨大的下降步骤-25.8 。

It's a so large step that you can't go back to the correct m and b . 这是一个很大的步骤，你不能回到正确的m和b 。 You have to make your step small enough. 你必须让自己的步伐足够小。

There are two ways to make gradient descent step reasonable: 有两种方法可以使梯度下降步骤合理：

initialize a small learning rate, such as 0.001 and 0.0003. 初始化一个小的学习率，如0.001和0.0003。
Divide your step by the total amount of your input values. 将您的步数除以输入值的总和。

为什么简单的梯度下降发散？

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-12-28 02:41:53

解决方案2
1 2016-12-28 02:51:09

为什么简单的梯度下降发散？

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-12-28 02:41:53

解决方案2 1 2016-12-28 02:51:09

解决方案1
2 已采纳 2016-12-28 02:41:53

解决方案2
1 2016-12-28 02:51:09