Improving Matlab + CUSP MEX solution for A*x=B on CUDA GPU

Question

Matlab still can't compute sparse matrices on CUDA GPU. There are no such toolboxes (Jacket is discontinued) for that as well. That's why I am using CUSP integrated to Matlab through MEX file. However, my developed tool has two problems:

It is VERY unstable for big equation systems (actually beginning from only 100 elements),
It is tens or hundreds times slower than Matlab CPU alternative.

I'm solving A*x=b, where A is a sparse, symmetric matrix, b is a vector.

Hardware specs: Intel i7 3630QM, GT640M 2G, 8 GB DDR3. Software: Windows 8 64 bit, Matlab R2012b 64 bit, CUDA 5.0 64 bit, CUSP 0.3.1, Windows SDK v7.0, VS2010 compiler.

MEX code:

#include<cusp/csr_matrix.h>
#include <cusp/krylov/bicgstab.h>
#include <matrix.h>
#include <mex.h> 
#include <time.h>

void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[])
{
        double t1 =  clock();
          // data from Matlab       
        double *b = mxGetPr(prhs[1]);
        double *A = mxGetPr(prhs[0]);
        int n = mxGetM(prhs[0]);
        mwIndex *ir = mxGetIr(prhs[0]);
        mwIndex *jc = mxGetJc(prhs[0]);
        int N = jc[n];
        t1 = clock() - t1;

        double t2 =  clock();
          // initialization of matrix A in CSR format (jc and ir are exchanged, because Matlab uses CSC format
        cusp::csr_matrix<int,float,cusp::device_memory> Ag(n,n,3*n-2);
        thrust::copy(jc, jc + n + 1, Ag.row_offsets.begin());
        thrust::copy(ir, ir + N,     Ag.column_indices.begin());
        thrust::copy(A,  A  + N,     Ag.values.begin()); 
          // initialization of vector b
        cusp::array1d<float, cusp::device_memory> bg (b, b+n);
        cusp::array1d<float, cusp::device_memory> xg (n, 0);
        t2 = clock() - t2;

        double t3 =  clock();
          // bicgstab algorithm solution for vector x, when using 0.001 accuracy and precondition M
          // this is the slowest part, much slower than others
        cusp::verbose_monitor<float> monitor(bg, 5000, 1e-3);
        cusp::identity_operator<float, cusp::device_memory> M(n, n);
        cusp::krylov::bicgstab(Ag, xg, bg, monitor, M);        
        t3 = clock() - t3;

        double t4 =  clock();     
          // gathering solution vector bact on host to Matlab array T
        mxArray *T = mxCreateDoubleMatrix(n, 1, mxREAL);
        double *x  = mxGetPr(T);
        thrust::copy(xg.begin(), xg.end(), x);
        t4 = clock() - t4;

          // gathering execution times to Matlab array times
        mxArray *times=mxCreateDoubleMatrix(5, 1, mxREAL);
        double *timesb=mxGetPr(times);
        timesb[0]=t1; timesb[1]=t2; timesb[2]=t3; timesb[3]=t4; timesb[4]=monitor.iteration_count();

          // sending data back to Matlab
        plhs[0] = times; 
        plhs[1] = T;
}

Compile this code in MEX file (ex.cu) on Matlab using these commands (change second command for 32 bit if necessary):

>> !nvcc -c -arch sm_20 ex.cu -Xcompiler -fPIC -I "C:\Program Files\MATLAB\R2012b\extern\include" -I "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include
>> mex ex.obj -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\lib\x64" -lcudart

Sample matrices, vectors and compiled 64 bit MEX function: http://www.failai.lt/3fqkhvoslxyt/sampleData.7z.htm

Use:

tic; [times,x]=ex(K',F); toc;   %K has to be transposed for CSR

where times - separate execution times, where last element - count of iterations (bicgstab monitor) used for a solution, result - the solution of K*x=F.

Results ( http://www.failai.lt/rupaliln7kfb/results.7z.htm ):

K_int_6, F_int_6 - ok
K_11, F_11 - x(1) wrong, others ok
K_100000, F_100000 - x(1) wrong, others from beginning are ok but later are decreasing comparing to correct result.
K_100000, F_100000 - execution lasts 0.6 s on GPU (MEX) while 0.014 s on CPU ( tic;xcpu=K\\F;toc; ).

Could you look at that code, maybe try the MEX function, report about your results, suggest how to improve the function? Maybe you know any alternatives which enables sparce computations on GPU? I hope, it will be useful for everyone until Matlab releases its compatibility for sparse matrices on GPU :)

Answer 1

take a look at Matlab file exchange, cusp sparse class for gpus, support for single precision, real/complex: http://www.mathworks.com/matlabcentral/fileexchange/44423-gpu-sparse-accumarray-non-uniform-grid

sparse matrix vector multiply is overloaded with CUSP.

Improving Matlab + CUSP MEX solution for A*x=B on CUDA GPU

Question

1 answers

solution1
0 2013-12-02 19:08:04

Improving Matlab + CUSP MEX solution for A*x=B on CUDA GPU

Question

1 answers

solution1 0 2013-12-02 19:08:04

solution1
0 2013-12-02 19:08:04