简体繁体 English

使用 CUDA 以非线性最小二乘方式求解方程组

[英]Using CUDA to solve a system of equations in non-linear least squares fashion

原文 2012-11-07 20:28:07 6 5 cuda/ gpu/ linear-algebra/ mathematical-optimization/ hessian-matrix

Using CUDA, I would like to solve a system of equations with a non-linear least squares solver.使用 CUDA，我想用非线性最小二乘求解器求解方程组。 These methods are discussed in an excellent booklet that can be downloaded here .这些方法在一本优秀的小册子中进行了讨论，可以在这里下载。

The Jacobian matrix in my problem is sparse and lower triangular.我的问题中的雅可比矩阵是稀疏的下三角矩阵。 Is there a library for CUDA available with these methods, or will I have to program these methods myself from the booklet?是否有可用于这些方法的 CUDA 库，还是我必须自己从手册中编写这些方法？

Is a Gauss-Newton non-linear least squares solver, Levenberg-Marquardt or Powell's Method solver available in a CUDA library (either free or non-free)? CUDA 库（免费或非免费）中是否提供 Gauss-Newton 非线性最小二乘求解器、Levenberg-Marquardt 或 Powell 方法求解器？

5 个解决方案

Before pointing out a possible, simple implementation of a quasi-Newton optimization routine in CUDA, some words on how a quasi-Newton optimizer works. 在指出可能的，简单的在CUDA中实现拟牛顿优化例程之前，先说一下拟牛顿优化器的工作原理。

Consider a function f of N real variables x and make a second order expansion around a certain point xi : 考虑一个由N个实变量x构成的函数f ，并围绕某个点xi进行二阶展开：

在此处输入图片说明

where A is the Hessian matrix. 其中A是Hessian矩阵。

To find a minimum starting from a point xi , Newton's method consists of forcing 为了找到从点xi开始的最小值，牛顿的方法包括强迫

在此处输入图片说明

which entails 这需要

在此处输入图片说明

and which, in turn, implies to know the inverse of the Hessian. 反过来，这意味着知道黑森州的逆。 Furthermore, to ensure the function decreases, the update direction 此外，为确保功能降低，更新方向

在此处输入图片说明

should be such that 应该是这样的

在此处输入图片说明

which implies that 这意味着

在此处输入图片说明

According to the above inequality, the Hessian matrix should be definite positive. 根据上述不等式，Hessian矩阵应定为正。 Unfortunately, the Hessian matrix is not necessarily definite positive, especially far from a minimum of f , so using the inverse of the Hessian, besides being computationally burdened, can be also deleterious, pushing the procedure even farther from the minimum towards regions of increasing values of f . 不幸的是，Hessian矩阵不一定是肯定的正数，尤其是远离f的最小值时，因此使用Hessian的逆数，除了在计算上很繁重之外，还可能有害，从而使该过程从最小值进一步推向增加值的区域的f 。 Generally speaking, it is more convenient to use a quasi-Newton method, ie, an approximation of the inverse of the Hessian, which keeps definite positive and updates iteration after iterations converging to the inverse of the Hessian itself. 一般而言，使用拟牛顿法（即Hessian逆函数的近似值）更为方便，该方法保持一定的正值并在迭代收敛到Hessian本身的逆函数后更新迭代。 A rough justification of a quasi-Newton method is the following. 以下是拟牛顿法的一个大致证明。 Consider 考虑

在此处输入图片说明

and 和

在此处输入图片说明

Subtracting the two equations, we have the update rule for the Newton procedure 减去两个方程，我们得到牛顿过程的更新规则

在此处输入图片说明

The updating rule for the quasi-Newton procedure is the following 拟牛顿过程的更新规则如下

在此处输入图片说明

where Hi+1 is the mentioned matrix approximating the inverse of the Hessian and updating step after step. 其中Hi + 1是上述矩阵，近似于Hessian的逆，并逐步更新。

There are several rules for updating Hi+1 , and I'm not going into the details of this point. 有一些更新Hi + 1的规则，在此不做详细介绍。 A very common one is provided by the Broyden-Fletcher-Goldfarb-Shanno , but in many cases the Polak-Ribiére scheme, is effective enough. Broyden-Fletcher-Goldfarb-Shanno提供了一种非常普遍的方案，但是在许多情况下， Polak-Ribiére方案足够有效。

The CUDA implementation can follow the same steps of the classical Numerical Recipes approach, but taking into account that: CUDA的实现可以遵循经典数字食谱方法的相同步骤，但要考虑到以下几点：

1) Vector and matrix operations can be effectively accomplished by CUDA Thrust or cuBLAS; 1）向量和矩阵运算可通过CUDA Thrust或cuBLAS有效完成； 2) The control logic can be performed by the CPU; 2）控制逻辑可以由CPU执行； 3) Line minimization, involving roots bracketing and root findings, can be performed on the CPU, accelerating only the cost functional and gradient evaluations of the GPU. 3）可以在CPU上执行线最小化，其中包括根括号和根结果，从而仅加速了GPU的成本函数和梯度评估。

By the above scheme, unknowns, gradients and Hessian can be kept on the device without any need to move them back and forth from host to device. 通过上述方案，可以将未知数，渐变和Hessian保留在设备上，而无需将它们在主机之间来回移动。

Please, note also that some approaches are available in the literature in which attempt to parallelize the line minimization are also proposed, see 请也请注意，文献中提供了一些方法，其中还提出了尝试使线最小化并行化的方法，请参见

Y. Fei, G. Rong, B. Wang, W. Wang, "Parallel L-BFGS-B algorithm on GPU", Computers & Graphics , vol. Y. Fei，G。Rong，B。Wang，W。Wang，“ GPU上的并行L-BFGS-B算法”，《 计算机与图形》 ，第1卷。 40, 2014, pp. 1-9. 40，2014，pp.1-9。

At this github page , a full CUDA implementation is available, generalizing the Numerical Recipes approach employing linmin , mkbrak and dbrent to the GPU parallel case. 在此github页面上，可以使用完整的CUDA实现，将使用linmin ， mkbrak和dbrent的数值食谱方法dbrent到GPU并行情况。 That approach implements Polak-Ribiére's scheme, but can be easily generalized to other quasi-Newton optimization problems. 该方法实现了Polak-Ribiére的方案，但可以轻松地推广到其他拟牛顿优化问题。

还要看一下： libflame包含BLAS和LAPACK库提供的许多操作的实现。

Nvidia released a function that can do exactly this, called csrlsvqr , which performs well on small matrices. Nvidia 发布了一个可以做到这一点的函数，称为csrlsvqr ，它在小矩阵上表现良好。 Unfortunately, for large sparse matrices, results (in my experience), have been poor.不幸的是，对于大型稀疏矩阵，结果（根据我的经验）一直很差。 It is not able to converge on a solution.它无法收敛于解决方案。

To solve this I wrote my own tool that can accomplish this, LSQR-CUDA .为了解决这个问题，我编写了自己的工具LSQR-CUDA 。

There are no procedures currently available in any library that implement solving a system of equations with a non-linear least squares solver using the CUDA platform. 当前在任何库中都没有可用的程序来实现使用CUDA平台使用非线性最小二乘法求解器求解方程组的过程。 These algorithms must be written from scratch, with help from some other libraries that implement linear algebra with sparse matrices. 这些算法必须从头开始编写，并需要其他一些使用稀疏矩阵实现线性代数的库的帮助。 Also, as mentioned in the comment above, the cuBLAS library will help with linear algebra. 同样，如上面的评论中所述，cuBLAS库将有助于线性代数。

https://developer.nvidia.com/cusparse https://developer.nvidia.com/cusparse

http://code.google.com/p/cusp-library/ http://code.google.com/p/cusp-library/

For those who are still looking for an answer, this one is for sparse matrix: OpenOF, "Framework for Sparse Non-linear Least Squares Optimization on a GPU" 对于仍在寻找答案的用户，此对象适用于稀疏矩阵：OpenOF，“ GPU上的稀疏非线性最小二乘法优化框架”