简体繁体 English

逻辑回归python求解器的定义

[英]Logistic regression python solvers' definitions

原文 2016-07-28 15:02:13 5 1 python/ python-3.x/ scikit-learn/ logistic-regression

I am using the logistic regression function from sklearn, and was wondering what each of the solver is actually doing behind the scenes to solve the optimization problem.我正在使用 sklearn 的逻辑回归函数，并且想知道每个求解器实际上在幕后做什么来解决优化问题。

Can someone briefly describe what "newton-cg", "sag", "lbfgs" and "liblinear" are doing?有人可以简要描述一下“newton-cg”、“sag”、“lbfgs”和“liblinear”在做什么吗？

1 个解决方案

Well, I hope I'm not too late to the party!好吧，我希望我参加聚会还不算太晚！ Let me first try to establish some intuition before digging in loads of information ( warning : this is not brief comparison )在挖掘大量信息之前，让我先尝试建立一些直觉（警告：这不是简短的比较）

Introduction简介

A hypothesis h(x) , takes an input and gives us the estimated output value .假设h(x)接受输入并为我们提供估计的输出值。

This hypothesis can be a as simple as a one variable linear equation, .. up to a very complicated and long multivariate equation with respect to the type of the algorithm we're using ( ie linear regression, logistic regression..etc ).这个假设可以是一个简单的单变量线性方程，也可以是一个非常复杂和长的多元方程，与我们使用的算法类型有关（即线性回归、逻辑回归......等）。

Our task is to find the best Parameters (aka Thetas or Weights) that give us the least error in predicting the output.我们的任务是找到最好的参数（又名 Thetas 或 Weights），使我们在预测输出时误差最小。 We call this error a Cost or Loss Function and apparently our goal is to minimize it in order to get the best predicted output!我们将此错误称为成本或损失函数，显然我们的目标是将其最小化以获得最佳预测输出！

One more thing to recall, that the relation between the parameter value and its effect on the cost function (ie the error) looks like a bell curve (ie Quadratic ; recall this because it's very important) .还有一件事要记住，参数值与其对成本函数（即误差）的影响之间的关系看起来像钟形曲线（即二次曲线；记住这一点，因为它非常重要）。

So if we start at any point in that curve and if we keep taking the derivative (ie tangent line) of each point we stop at, we will end up at what so called the Global Optima as shown in this image:因此，如果我们从该曲线的任何一点开始，如果我们继续取我们停止的每个点的导数（即切线），我们将最终达到所谓的全局最优，如下图所示：

If we take the partial derivative at minimum cost point (ie global optima) we find the slope of the tangent line = 0 (then we know that we reached our target).如果我们在最小成本点（即全局最优）处取偏导数，我们会发现切线的斜率= 0 （然后我们知道我们达到了目标）。

That's valid only if we have Convex Cost Function, but if we don't, we may end up stuck at what so called Local Optima ;这仅在我们有凸成本函数时才有效，但如果我们没有，我们可能最终会陷入所谓的局部最优； consider this non-convex function:考虑这个非凸函数：

Now you should have the intuition about the hack relationship between what we are doing and the terms: Deravative , Tangent Line , Cost Function , Hypothesis ..etc.现在您应该对我们正在做的事情与术语之间的 hack 关系有了直觉： Deravative 、 Tangent Line 、 Cost Function 、 Hypothesis .. 等。

Side Note: The above mentioned intuition also related to the Gradient Descent Algorithm (see later).旁注：上述直觉也与梯度下降算法有关（见下文）。

Background背景

Linear Approximation:线性近似：

Given a function, f(x) , we can find its tangent at x=a .给定一个函数f(x) ，我们可以找到它在x=a处的切线。 The equation of the tangent line L(x) is: L(x)=f(a)+f′(a)(x−a) .切线L(x)的方程为： L(x)=f(a)+f′(a)(x−a) 。

Take a look at the following graph of a function and its tangent line:看看下面的函数及其切线图：

From this graph we can see that near x=a , the tangent line and the function have nearly the same graph.从这个图中我们可以看到，在x=a附近，切线和函数具有几乎相同的图。 On occasion we will use the tangent line, L(x) , as an approximation to the function, f(x) , near x=a .有时我们会使用切线L(x)作为函数f(x)的近似值，靠近x=a 。 In these cases we call the tangent line the linear approximation to the function at x=a .在这些情况下，我们称切线为函数在x=a处的线性近似。

Quadratic Approximation:二次近似：

Same like linear approximation but this time we are dealing with a curve but we cannot find the point near to 0 by using the tangent line.与线性近似相同，但这次我们处理的是一条曲线，但我们无法通过使用切线找到接近0的点。

Instead, we use a parabola ( which is a curve where any point is at an equal distance from a fixed point or a fixed straight line ), like this:相反，我们使用抛物线（这是一条曲线，其中任何点与固定点或固定直线的距离相等），如下所示：

And in order to fit a good parabola, both parabola and quadratic function should have same value, same first derivative, AND second derivative, ... the formula will be ( just out of curiosity ): Qa(x) = f(a) + f'(a)(xa) + f''(a)(xa)2/2为了拟合一个好的抛物线，抛物线和二次函数应该具有相同的值，相同的一阶导数和二阶导数，......公式将是（只是出于好奇）： Qa(x) = f(a) + f'(a)(xa) + f''(a)(xa)2/2

Now we should be ready to do the comparison in details.现在我们应该准备好进行详细的比较了。

Comparison between the methods方法之间的比较

1. Newton's Method 1. 牛顿法

Recall the motivation for gradient descent step at x: we minimize the quadratic function (ie Cost Function).回想 x 处梯度下降步骤的动机：我们最小化二次函数（即成本函数）。

Newton's method uses in a sense a better quadratic function minimisation.牛顿方法在某种意义上使用了更好的二次函数最小化。 A better because it uses the quadratic approximation (ie first AND second partial derivatives). A 更好，因为它使用二次近似（即一阶和二阶偏导数）。

You can imagine it as a twisted Gradient Descent with The Hessian ( The Hessian is a square matrix of second-order partial derivatives of order nxn ).您可以将其想象为带有 Hessian 的扭曲梯度下降（ Hessian 是 nxn 阶二阶偏导数的方阵）。

Moreover, the geometric interpretation of Newton's method is that at each iteration one approximates f(x) by a quadratic function around xn , and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point).此外，牛顿方法的几何解释是，在每次迭代中，通过一个围绕xn的二次函数来逼近f(x) ，然后朝着该二次函数的最大值/最小值迈出一步（在更高维度，这也可能是一个鞍点）。 Note that if f(x) happens to be a quadratic function, then the exact extremum is found in one step.请注意，如果f(x)恰好是一个二次函数，则可以在一步中找到精确的极值。

Drawbacks:缺点：

It's computationally expensive because of The Hessian Matrix (ie second partial derivatives calculations).由于 Hessian 矩阵（即二阶偏导数计算），它的计算成本很高。
It attracts to Saddle Points which are common in multivariable optimization (ie a point its partial derivatives disagree over whether this input should be a maximum or a minimum point!).它吸引了在多变量优化中很常见的鞍点（即它的偏导数不同意这个输入应该是最大值还是最小值的点！）。

2. Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm: 2. 有限内存 Broyden-Fletcher-Goldfarb-Shanno 算法：

In a nutshell, it is analogue of the Newton's Method but here the Hessian matrix is approximated using updates specified by gradient evaluations (or approximate gradient evaluations).简而言之，它类似于牛顿法，但这里使用梯度评估（或近似梯度评估）指定的更新来近似Hessian 矩阵。 In other words, using an estimation to the inverse Hessian matrix.换句话说，使用对 Hessian 逆矩阵的估计。

The term Limited-memory simply means it stores only a few vectors that represent the approximation implicitly.有限内存一词仅表示它仅存储一些隐式表示近似值的向量。

If I dare say that when dataset is small , L-BFGS relatively performs the best compared to other methods especially it saves a lot of memory, however there are some “ serious ” drawbacks such that if it is unsafeguarded, it may not converge to anything.如果我敢说，当数据集较小时，L-BFGS 与其他方法相比相对表现最好，尤其是它节省了大量内存，但也有一些“严重”的缺点，如果不加以保护，它可能不会收敛到任何东西.

Side note: This solver has became the default solver in sklearn LogisticRegression since version 0.22, replacing LIBLINEAR.旁注：自 0.22 版以来，此求解器已成为 sklearn LogisticRegression 中的默认求解器，取代了 LIBLINEAR。

3. A Library for Large Linear Classification: 3. 大型线性分类库：

It's a linear classification that supports logistic regression and linear support vector machines ( A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics ie feature value ).它是一种支持逻辑回归和线性支持向量机的线性分类（线性分类器通过基于特征的线性组合的值即特征值做出分类决策来实现这一点）。

The solver uses a coordinate descent (CD) algorithm that solves optimization problems by successively performing approximate minimization along coordinate directions or coordinate hyperplanes.求解器使用坐标下降 (CD) 算法，通过沿坐标方向或坐标超平面连续执行近似最小化来解决优化问题。

LIBLINEAR is the winner of ICML 2008 large-scale learning challenge. LIBLINEAR是 ICML 2008 大规模学习挑战赛的获胜者。 It applies Automatic parameter selection (aka L1 Regularization) and it's recommended when you have high dimension dataset ( recommended for solving large-scale classification problems )它应用自动参数选择（又名 L1 正则化），当您有高维数据集时推荐使用（推荐用于解决大规模分类问题）

Drawbacks:缺点：

It may get stuck at a non-stationary point (ie non-optima) if the level curves of a function are not smooth.如果函数的水平曲线不平滑，它可能会卡在非平稳点（即非最优）。
Also cannot run in parallel.也不能并行运行。
It cannot learn a true multinomial (multiclass) model;它无法学习真正的多项（多类）模型； instead, the optimization problem is decomposed in a “one-vs-rest” fashion so separate binary classifiers are trained for all classes.相反，优化问题以“one-vs-rest”方式分解，因此针对所有类别训练单独的二元分类器。

Side note: According to Scikit Documentation: The “liblinear” solver was the one used by default for historical reasons before version 0.22.旁注：根据 Scikit 文档：“liblinear”求解器是 0.22 版本之前的历史原因默认使用的求解器。 Since then, default use is Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm.从那时起，默认使用的是有限内存 Broyden–Fletcher–Goldfarb–Shanno 算法。

4. Stochastic Average Gradient: 4. 随机平均梯度：

SAG method optimizes the sum of a finite number of smooth convex functions. SAG 方法优化有限数量的平滑凸函数的总和。 Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum.与随机梯度 (SG) 方法一样，SAG 方法的迭代成本与总和中的项数无关。 However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods.然而，通过结合先前梯度值的记忆，SAG 方法实现了比黑盒 SG 方法更快的收敛速度。

It is faster than other solvers for large datasets, when both the number of samples and the number of features are large.对于大型数据集，当样本数量和特征数量都很大时，它比其他求解器更快。

Drawbacks:缺点：

It only supports L2 penalization.它只支持 L2 惩罚。
Its memory cost of O(N) , which can make it impractical for large N ( because it remembers the most recently computed values for approx. all gradients ).它的内存成本为O(N) ，这对于大 N 来说可能不切实际（因为它会记住大约所有梯度的最近计算值）。

5. SAGA: 5.传奇：

The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (ie L1 Regularization). SAGA 求解器是 SAG 的一个变体，它也支持非平滑惩罚=l1选项（即 L1 正则化）。 This is therefore the solver of choice for sparse multinomial logistic regression and it's also suitable very Large dataset.因此，这是稀疏多项逻辑回归的首选求解器，它也适用于非常大的数据集。