如何在CUDA中实现与子矩阵的接口？

Question

I have a wrapper class CudaMatrix that implements several cuBLAS operations, allowing me to call m1.multiply(m2) that runs the sgemm operation on the internal data pointers. 我有一个包装器类CudaMatrix ，它实现了多个cuBLAS操作，使我可以调用m1.multiply(m2) ，该sgemm m1.multiply(m2)在内部数据指针上运行sgemm操作。

I would like to extend the class by operations on sub-matrices, something like 我想通过对子矩阵的操作来扩展类，例如

CudaMatrix a(100,100);
CudaMatrix b(100,100);
// fill a and b

int i=5, j=15;
CudaSubMatrix sa(a, i, j, i+10, j+10); // sa := a[5:15, 15:25]

i=50, j=60;
CudaSubMatrix sb(b, i, j, i+10, j+10); // sb := b[50:60, 60:70]    

CudaMatrix res;
res.copy(sa);
res.multiply(sb)  // res = sa*sb

In the last row, multiply() needs to operate on a sub-matrix sb , so the rows are not contiguous and I can't call the same sgemm operations as before. 在最后一行中， multiply()需要在子矩阵sb上进行操作，因此这些行不是连续的，并且我无法调用与之前相同的sgemm操作。

How do I implement an efficient interface to sub-matrices that avoids copying data explicitly? 如何为子矩阵实现有效的接口，从而避免显式复制数据？ Are there any open-source implementations that I can look for? 我可以寻找任何开源实现吗？

Answer 1

The sub-matrices multiply may be performed using the ldx parameter of the API calls. 可以使用API调用的ldx参数执行子矩阵乘法。

Indexing is described at the 1.1 DataLayout section: 索引在1.1 DataLayout部分中介绍：

#define IDX2C(i,j,ld) (((j)*(ld))+(i)) ＃定义IDX2C（i，j，ld）（（（j）*（ld））+（i））

Then use the cublasSgemm for example with lda parameter equal to the number of lines 然后使用cublasSgemm ，例如lda参数等于行数

the cuBLAS library uses column-major storage cuBLAS库使用列主存储

of the original matrix, and m , n , k for the sub-matrices. 表示原始矩阵， m ， n ， k表示子矩阵。

Note indexing might differ in fortran for C indexing scheme. 注意，在fortran for C索引方案中，索引编制可能有所不同。

Hence what you really need is the size of your sub-matrix (col,rows), and the size of a column in the input matrix (its number of lines). 因此，您真正需要的是子矩阵的大小（col，行）以及输入矩阵中的列的大小（其行数）。

如何在CUDA中实现与子矩阵的接口？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-29 13:10:12

如何在CUDA中实现与子矩阵的接口？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-29 13:10:12

解决方案1
1 已采纳 2016-04-29 13:10:12