改进CUDA GPU上A * x = B的Matlab + CUSP MEX解决方案

Question

Matlab still can't compute sparse matrices on CUDA GPU. Matlab仍然无法在CUDA GPU上计算稀疏矩阵。 There are no such toolboxes (Jacket is discontinued) for that as well. 也没有这样的工具箱（Jacket已停产）。 That's why I am using CUSP integrated to Matlab through MEX file. 这就是为什么我使用通过MEX文件集成到Matlab的CUSP。 However, my developed tool has two problems: 但是，我开发的工具有两个问题：

It is VERY unstable for big equation systems (actually beginning from only 100 elements), 对于大型方程组（实际上仅从100个元素开始），它非常不稳定，
It is tens or hundreds times slower than Matlab CPU alternative. 它比替代Matlab CPU慢几十倍或几百倍。

I'm solving A*x=b, where A is a sparse, symmetric matrix, b is a vector. 我正在求解A * x = b，其中A是一个稀疏的对称矩阵，b是一个向量。

Hardware specs: Intel i7 3630QM, GT640M 2G, 8 GB DDR3. 硬件规格：英特尔i7 3630QM，GT640M 2G，8 GB DDR3。 Software: Windows 8 64 bit, Matlab R2012b 64 bit, CUDA 5.0 64 bit, CUSP 0.3.1, Windows SDK v7.0, VS2010 compiler. 软件：Windows 8 64位，Matlab R2012b 64位，CUDA 5.0 64位，CUSP 0.3.1，Windows SDK v7.0，VS2010编译器。

MEX code: MEX代码：

#include<cusp/csr_matrix.h>
#include <cusp/krylov/bicgstab.h>
#include <matrix.h>
#include <mex.h> 
#include <time.h>

void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[])
{
        double t1 =  clock();
          // data from Matlab       
        double *b = mxGetPr(prhs[1]);
        double *A = mxGetPr(prhs[0]);
        int n = mxGetM(prhs[0]);
        mwIndex *ir = mxGetIr(prhs[0]);
        mwIndex *jc = mxGetJc(prhs[0]);
        int N = jc[n];
        t1 = clock() - t1;

        double t2 =  clock();
          // initialization of matrix A in CSR format (jc and ir are exchanged, because Matlab uses CSC format
        cusp::csr_matrix<int,float,cusp::device_memory> Ag(n,n,3*n-2);
        thrust::copy(jc, jc + n + 1, Ag.row_offsets.begin());
        thrust::copy(ir, ir + N,     Ag.column_indices.begin());
        thrust::copy(A,  A  + N,     Ag.values.begin()); 
          // initialization of vector b
        cusp::array1d<float, cusp::device_memory> bg (b, b+n);
        cusp::array1d<float, cusp::device_memory> xg (n, 0);
        t2 = clock() - t2;

        double t3 =  clock();
          // bicgstab algorithm solution for vector x, when using 0.001 accuracy and precondition M
          // this is the slowest part, much slower than others
        cusp::verbose_monitor<float> monitor(bg, 5000, 1e-3);
        cusp::identity_operator<float, cusp::device_memory> M(n, n);
        cusp::krylov::bicgstab(Ag, xg, bg, monitor, M);        
        t3 = clock() - t3;

        double t4 =  clock();     
          // gathering solution vector bact on host to Matlab array T
        mxArray *T = mxCreateDoubleMatrix(n, 1, mxREAL);
        double *x  = mxGetPr(T);
        thrust::copy(xg.begin(), xg.end(), x);
        t4 = clock() - t4;

          // gathering execution times to Matlab array times
        mxArray *times=mxCreateDoubleMatrix(5, 1, mxREAL);
        double *timesb=mxGetPr(times);
        timesb[0]=t1; timesb[1]=t2; timesb[2]=t3; timesb[3]=t4; timesb[4]=monitor.iteration_count();

          // sending data back to Matlab
        plhs[0] = times; 
        plhs[1] = T;
}

Compile this code in MEX file (ex.cu) on Matlab using these commands (change second command for 32 bit if necessary): 使用以下命令在Matlab的MEX文件（ex.cu）中编译此代码（如有必要，将第二个命令更改为32位）：

>> !nvcc -c -arch sm_20 ex.cu -Xcompiler -fPIC -I "C:\Program Files\MATLAB\R2012b\extern\include" -I "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include
>> mex ex.obj -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\lib\x64" -lcudart

Sample matrices, vectors and compiled 64 bit MEX function: http://www.failai.lt/3fqkhvoslxyt/sampleData.7z.htm 样本矩阵，向量和已编译的64位MEX函数： http : //www.failai.lt/3fqkhvoslxyt/sampleData.7z.htm

Use: 采用：

tic; [times,x]=ex(K',F); toc;   %K has to be transposed for CSR

where times - separate execution times, where last element - count of iterations (bicgstab monitor) used for a solution, result - the solution of K*x=F. 其中，时间-单独的执行时间，最后一个元素-用于解决方案的迭代次数（bicgstab监视器），结果-K * x = F的解决方案。

Results ( http://www.failai.lt/rupaliln7kfb/results.7z.htm ): 结果（ http://www.failai.lt/rupaliln7kfb/results.7z.htm ）：

K_int_6, F_int_6 - ok K_int_6，F_int_6-好
K_11, F_11 - x(1) wrong, others ok K_11，F_11-x（1）错误，其他还可以
K_100000, F_100000 - x(1) wrong, others from beginning are ok but later are decreasing comparing to correct result. K_100000，F_100000-x（1）错误，其他开头都可以，但后来与正确结果相比有所减少。
K_100000, F_100000 - execution lasts 0.6 s on GPU (MEX) while 0.014 s on CPU ( tic;xcpu=K\\F;toc; ). K_100000，F_100000-GPU（MEX）上的执行持续0.6 s，而CPU（ tic; xcpu = K \\ F; toc; ）上的执行持续0.014 s。

Could you look at that code, maybe try the MEX function, report about your results, suggest how to improve the function? 您能否看一下这些代码，或者尝试使用MEX函数，报告您的结果，并提出如何改进该函数的建议？ Maybe you know any alternatives which enables sparce computations on GPU? 也许您知道可以在GPU上进行稀疏计算的任何替代方法？ I hope, it will be useful for everyone until Matlab releases its compatibility for sparse matrices on GPU :) 我希望，这对每个人都将有用，直到Matlab在GPU上发布对稀疏矩阵的兼容性为止：）

Answer 1

take a look at Matlab file exchange, cusp sparse class for gpus, support for single precision, real/complex: http://www.mathworks.com/matlabcentral/fileexchange/44423-gpu-sparse-accumarray-non-uniform-grid 看看Matlab文件交换，适用于gpus的cusp稀疏类，对单精度，实/复杂的支持： http : //www.mathworks.com/matlabcentral/fileexchange/44423-gpu-sparse-accumarray-non-uniform-grid

sparse matrix vector multiply is overloaded with CUSP. 稀疏矩阵向量乘法因CUSP而过载。

改进CUDA GPU上A * x = B的Matlab + CUSP MEX解决方案

问题描述

1 个解决方案

解决方案1
0 2013-12-02 19:08:04

改进CUDA GPU上A * x = B的Matlab + CUSP MEX解决方案

问题描述

1 个解决方案

解决方案1 0 2013-12-02 19:08:04

解决方案1
0 2013-12-02 19:08:04