简体繁体 English

CUDA中的矩阵运算

[英]Matrix operations in CUDA

原文 2011-03-17 09:43:52 3 2 c/ cuda

What is the best way to organize matrix operations in CUDA (in terms of performance)? 在CUDA中组织矩阵运算的最佳方法是什么（就性能而言）？ For example, I want to calculate C * C^(-1) * B^T + C , C and B are matrices. 例如，我要计算C * C^(-1) * B^T + C ， C和B是矩阵。

Should I write separate functions for multiplication, transposition and so on or write one function for the whole expression? 我应该为乘法，换位等编写单独的函数，还是为整个表达式编写一个函数？

Which way is the fastest? 哪条路最快？

2 个解决方案

I'd recommend you to use the CUBLAS library. 我建议您使用CUBLAS库。 It's normally much daster and more reliable than everything you could write on your own. 与您自己可以编写的所有内容相比，它通常要困难得多并且更可靠。 In addition it's API is similar to the BLAS library which is the standard library for numerical linear algebra. 此外，它的API与BLAS库相似，后者是用于数值线性代数的标准库。

I think the answer depends heavily on the size of your matrices. 我认为答案很大程度上取决于矩阵的大小。

If you can fit a matrix in shared memory, I would probably use a single block to compute that and have all inside a single kernel (probably bigger, where this computation is only a part of it). 如果您可以在共享内存中放入一个矩阵，那么我可能会使用单个块来进行计算，并且全部都包含在单个内核中（可能更大，这种计算只是其中的一部分）。 Hopefully, if you have more matrices, and you need to compute the above equation several times, you can do it in parallel, utilising all GPU computing power. 希望，如果您有更多的矩阵，并且需要多次计算上述方程式，则可以利用所有GPU计算能力并行进行计算。

However, if your matrices are much bigger, you will want more blocks to compute that (check matrix multiplication example in CUDA manual). 但是，如果矩阵大得多，则需要更多的块来计算（检查CUDA手册中的矩阵乘法示例）。 You need a guarantee that multiplication is finished by all blocks before you proceed with the next part of your equation, and if so, you will need a kernel call for each of your operations. 在继续进行方程式的下一部分之前，您需要确保所有块都已完成乘法运算；如果是，则需要为每个操作调用内核。