不同的Python最小化函数给出不同的值，为什么？

Question

I'm trying to learn python by rewriting Andrew Ng's Machine learning course assignments from Octave (I took the classed and got the certificate). 我正在尝试通过重写Octave的Andrew Ng的机器学习课程作业来学习python（我上了课程并获得了证书）。 I'm having issues with the optimization functions. 我在优化功能方面遇到问题。 In the course they use fmincg which is a function used in Octave to minimize a the cost function (convex functions) of linear regression providing its derivative. 在使用过程中，他们使用fmincg（这是Octave中使用的函数）来最小化提供其导数的线性回归的成本函数（凸函数）。 They also teach you how to use gradient descent and the normal equation, which in theory they all give you the same result (within a few decimal places) if they've been used correctly. 他们还教您如何使用梯度下降法和正态方程，理论上，如果正确使用它们，它们都会为您提供相同的结果（在小数点后几位）。 They all work great for linear regression and I do get the same results in python. 它们都非常适合线性回归，而我在python中也得到了相同的结果。 To be clear I'm trying to minimize the cost function to find the best fitting parameters (theta) of the data set. 为了清楚起见，我正在尝试最小化成本函数，以找到数据集的最佳拟合参数（θ）。 So far I've used 'nelder-mead' which doesn't need the derivative and it gave me the closest looking solution to what they have. 到目前为止，我已经使用了“纳德米德”方法，该方法不需要导数，它为我提供了最接近他们所拥有的解决方案。 I've also tried 'TNC', 'CG' and 'BFGS', which all require a derivative to minimize the function. 我还尝试过“ TNC”，“ CG”和“ BFGS”，它们都需要使用导数来最小化功能。 They all work great when I have first order polynomial (linear) but when I increase the order of the polynomial to something non-linear and in my case I have x^1 up to x^8, then I can't get my function to fit the data set. 当我拥有一阶多项式（线性）时，它们都很好用，但是当我将多项式的阶数增加到非线性的东西时，在我的情况下，我的x ^ 1到x ^ 8，那么我就无法获得函数以适合数据集。 The exercise I'm doing is really simple, I have 12 data points so putting an 8th order polynomial should capture every single point (if you're curious it's an example of high variance ie overfitting the data). 我正在做的练习非常简单，我有12个数据点，因此放置一个8阶多项式应该捕获每个点（如果您很好奇，这是一个高方差示例，即过度拟合数据）。 The solution they show, is a line that goes through all the data points as expected and captures everything. 他们显示的解决方案是一条按预期方式遍历所有数据点并捕获所有内容的线。 The best I got was when I used 'nelder-mead' method and it only captured two point out of the data sets, while the rest of the minimization functions didn't even give me anything close to what I'm looking for. 我得到的最好的结果是，当我使用“纳德-米德”方法时，它仅捕获了数据集中的两个点，而其余的最小化功能甚至都没有给我任何我想要的东西。 I'm not sure what's wrong because my cost function and gradients are giving the right values for the linear case so I'm assuming they're working fine (the exact answer of Octave). 我不确定出什么问题了，因为我的成本函数和渐变为线性情况给出了正确的值，因此我假设它们工作正常（Octave的确切答案）。

I'm going to list the the functions both in Octave and python in hope someone can explain to me why I'm getting the different answers. 我将列出Octave和python中的函数，希望有人可以向我解释为什么我得到不同的答案。 Or point out the obvious error that I'm not seeing. 或指出我没有看到的明显错误。

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear 
%regression with multiple variables
%   [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the 
%   cost of using theta as the parameter for linear regression to fit the 
%   data points in X and y. Returns the cost in J and the gradient in grad


m = length(y); % number of training examples 
J = 0;
grad = zeros(size(theta));

htheta = X * theta;
n = size(theta);
J = 1 / (2 * m) * sum((htheta - y) .^ 2) + lambda / (2 * m) * sum(theta(2:n) .^ 2);

grad = 1 / m * X' * (htheta - y);
grad(2:n) = grad(2:n) + lambda / m * theta(2:n); # we leave the bias nice 
grad = grad(:);

end

Here is a snippets of my code and if anyone likes the full code, I can provide that as well: 这是我的代码片段，如果有人喜欢完整的代码，我也可以提供：

def costFunction(theta, Xcost, y, lmda):
    m = len(y)
    theta = theta.reshape((len(theta),1))
    htheta = np.dot(Xcost,theta) - y 
    J = 1 / (2 * m) * np.dot(htheta.T,htheta) + lmda / (2 * m) * np.sum(theta[1:,:]**2)
    return J

def gradCostFunc(gradtheta, X, y, lmda):
    m = len(y)
    gradtheta = gradtheta.reshape((len(gradtheta),1))
    hgradtheta = np.dot(X,gradtheta) - y 
    #gradtheta[0,0] = 0. 

    grad = (1 / m) * np.dot(X.T, hgradtheta)

    #for i in range(1,len(grad)):
    grad[1:,0] = grad[1:,0] + (lmda/m) * gradtheta[1:,0]
    return grad.reshape((len(grad)))

def normalEqn(X, y, lmda):
    e = np.eye(X.shape[1])
    e[0,0] = 0
    theta = np.dot(np.linalg.pinv(np.dot(X.T,X) + lmda * e),np.dot(X.T,y))
    return theta 

def gradientDescent(X, y, theta, alpha, lmda, num_iters):
    # calculate gradient descent in an iterative manner
    m = len(y)
    # J_history tracks the evolution of the cost function 
    J_history = np.zeros((num_iters,1))

    # Calculating the gradients 
    for i in range(0, num_iters):
        grad = np.zeros((len(theta),1))
        grad = gradCostFunc(theta, X, y, lmda)
        #updating the thetas 
        theta = theta - alpha * grad 
        J_history[i] = costFunction(theta, X, y, lmda)

    plt.plot(J_history)
    plt.show()

    return theta 

def trainLR(initheta, X, y, lmda):
    #print theta.shape, X.shape, y.shape, gradtest.shape gradCostFunc
    options = {'maxiter': 1000}
    res = optimize.minimize(costFunction, initheta, jac=gradCostFunc, method='CG',                            args=(X, y, lmda), options = options)
    #res = optimize.minimize(costFunction, theta, method='nelder-mead',                             args=(X,y,lmda), options={'disp': False})
    #res = optimize.fmin_bfgs(costFunction, theta, fprime=gradCostFunc, args=(X, y, lmda))
    return res.x

def polyFeatures(X, degree):
    # map the higher polynomials 
    out = X 
    if degree >= 2:
        for i in range(2,degree+1):
            out = np.column_stack((out,X**i))
        return out 
    else:
        return out

def featureNormalize(X):
    # Since the values will vary by orders of magnitudes 
    # It’s important to normalize the various features 
    mu = np.mean(X, axis=0)
    S1 = np.std(X, axis=0)
    return mu, S1, (X - mu)/S1

And here is the main call for these function: 这是这些函数的主要调用：

X, y, Xval, yval, Xtest, ytest = loadData('ex5data1.mat')
X_poly = X # to be used in the later on in the program 
p = 8 
X_poly = polyFeatures(X_poly, p)
mu, sigma, X_poly = featureNormalize(X_poly)
X_poly = padding(X_poly)
theta = np.zeros((X_poly.shape[1],1))
theta = trainLR(theta, X_poly, y, 0.)
#theta = normalEqn(X_poly, y, 0.)
#theta = gradientDescent(X_poly, y, theta, 0.1, 0, 1500)

Answer 1

My answer is probably off point, because your question was for help debugging your current implementation. 我的回答可能不合时宜，因为您的问题是帮助调试当前的实现。

That said, if you're interested in using ready-made optimisers in Python then have a look at OpenOpt . 就是说，如果您有兴趣在Python中使用现成的优化器，请查看OpenOpt 。 The library contains reasonably performant implementations of optimisers for a wide variety of optimisation problems. 该库包含针对各种优化问题的优化器的合理执行。

I should also mention that the scikit-learn library provides a nice Machine Learning toolset for Python. 我还应该提到scikit-learn库为Python提供了一个不错的机器学习工具集。

不同的Python最小化函数给出不同的值，为什么？

问题描述

1 个解决方案

解决方案1
0 已采纳 2013-12-20 21:25:43

不同的Python最小化函数给出不同的值，为什么？

问题描述

1 个解决方案

解决方案1 0 已采纳 2013-12-20 21:25:43

解决方案1
0 已采纳 2013-12-20 21:25:43