[英]Improving Matlab + CUSP MEX solution for A*x=B on CUDA GPU
Matlab still can't compute sparse matrices on CUDA GPU. Matlab仍然无法在CUDA GPU上计算稀疏矩阵。 There are no such toolboxes (Jacket is discontinued) for that as well.
也没有这样的工具箱(Jacket已停产)。 That's why I am using CUSP integrated to Matlab through MEX file.
这就是为什么我使用通过MEX文件集成到Matlab的CUSP。 However, my developed tool has two problems:
但是,我开发的工具有两个问题:
I'm solving A*x=b, where A is a sparse, symmetric matrix, b is a vector. 我正在求解A * x = b,其中A是一个稀疏的对称矩阵,b是一个向量。
Hardware specs: Intel i7 3630QM, GT640M 2G, 8 GB DDR3. 硬件规格:英特尔i7 3630QM,GT640M 2G,8 GB DDR3。 Software: Windows 8 64 bit, Matlab R2012b 64 bit, CUDA 5.0 64 bit, CUSP 0.3.1, Windows SDK v7.0, VS2010 compiler.
软件:Windows 8 64位,Matlab R2012b 64位,CUDA 5.0 64位,CUSP 0.3.1,Windows SDK v7.0,VS2010编译器。
MEX code: MEX代码:
#include<cusp/csr_matrix.h>
#include <cusp/krylov/bicgstab.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[])
{
double t1 = clock();
// data from Matlab
double *b = mxGetPr(prhs[1]);
double *A = mxGetPr(prhs[0]);
int n = mxGetM(prhs[0]);
mwIndex *ir = mxGetIr(prhs[0]);
mwIndex *jc = mxGetJc(prhs[0]);
int N = jc[n];
t1 = clock() - t1;
double t2 = clock();
// initialization of matrix A in CSR format (jc and ir are exchanged, because Matlab uses CSC format
cusp::csr_matrix<int,float,cusp::device_memory> Ag(n,n,3*n-2);
thrust::copy(jc, jc + n + 1, Ag.row_offsets.begin());
thrust::copy(ir, ir + N, Ag.column_indices.begin());
thrust::copy(A, A + N, Ag.values.begin());
// initialization of vector b
cusp::array1d<float, cusp::device_memory> bg (b, b+n);
cusp::array1d<float, cusp::device_memory> xg (n, 0);
t2 = clock() - t2;
double t3 = clock();
// bicgstab algorithm solution for vector x, when using 0.001 accuracy and precondition M
// this is the slowest part, much slower than others
cusp::verbose_monitor<float> monitor(bg, 5000, 1e-3);
cusp::identity_operator<float, cusp::device_memory> M(n, n);
cusp::krylov::bicgstab(Ag, xg, bg, monitor, M);
t3 = clock() - t3;
double t4 = clock();
// gathering solution vector bact on host to Matlab array T
mxArray *T = mxCreateDoubleMatrix(n, 1, mxREAL);
double *x = mxGetPr(T);
thrust::copy(xg.begin(), xg.end(), x);
t4 = clock() - t4;
// gathering execution times to Matlab array times
mxArray *times=mxCreateDoubleMatrix(5, 1, mxREAL);
double *timesb=mxGetPr(times);
timesb[0]=t1; timesb[1]=t2; timesb[2]=t3; timesb[3]=t4; timesb[4]=monitor.iteration_count();
// sending data back to Matlab
plhs[0] = times;
plhs[1] = T;
}
Compile this code in MEX file (ex.cu) on Matlab using these commands (change second command for 32 bit if necessary): 使用以下命令在Matlab的MEX文件(ex.cu)中编译此代码(如有必要,将第二个命令更改为32位):
>> !nvcc -c -arch sm_20 ex.cu -Xcompiler -fPIC -I "C:\Program Files\MATLAB\R2012b\extern\include" -I "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include
>> mex ex.obj -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\lib\x64" -lcudart
Sample matrices, vectors and compiled 64 bit MEX function: http://www.failai.lt/3fqkhvoslxyt/sampleData.7z.htm 样本矩阵,向量和已编译的64位MEX函数: http : //www.failai.lt/3fqkhvoslxyt/sampleData.7z.htm
Use: 采用:
tic; [times,x]=ex(K',F); toc; %K has to be transposed for CSR
where times - separate execution times, where last element - count of iterations (bicgstab monitor) used for a solution, result - the solution of K*x=F. 其中,时间-单独的执行时间,最后一个元素-用于解决方案的迭代次数(bicgstab监视器),结果-K * x = F的解决方案。
Results ( http://www.failai.lt/rupaliln7kfb/results.7z.htm ): 结果( http://www.failai.lt/rupaliln7kfb/results.7z.htm ):
Could you look at that code, maybe try the MEX function, report about your results, suggest how to improve the function? 您能否看一下这些代码,或者尝试使用MEX函数,报告您的结果,并提出如何改进该函数的建议? Maybe you know any alternatives which enables sparce computations on GPU?
也许您知道可以在GPU上进行稀疏计算的任何替代方法? I hope, it will be useful for everyone until Matlab releases its compatibility for sparse matrices on GPU :)
我希望,这对每个人都将有用,直到Matlab在GPU上发布对稀疏矩阵的兼容性为止:)
take a look at Matlab file exchange, cusp sparse class for gpus, support for single precision, real/complex: http://www.mathworks.com/matlabcentral/fileexchange/44423-gpu-sparse-accumarray-non-uniform-grid 看看Matlab文件交换,适用于gpus的cusp稀疏类,对单精度,实/复杂的支持: http : //www.mathworks.com/matlabcentral/fileexchange/44423-gpu-sparse-accumarray-non-uniform-grid
sparse matrix vector multiply is overloaded with CUSP. 稀疏矩阵向量乘法因CUSP而过载。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.