简体   繁体   English

PyCUDA:C / C ++包括?

[英]PyCUDA: C/C++ includes?

Something that isn't really mentioned anywhere (at least that I can see) is what library functions are exposed to inline CUDA kernels. 在任何地方都没有提到的东西(至少我可以看到)是库函数暴露给内联CUDA内核的东西。

Specifically I'm doing small / stupid matrix multiplications that don't deserve to be individually offloaded to the GPU but am offloading a larger section of the algorithm which includes this multiplication. 具体来说,我正在进行小/愚蠢的矩阵乘法,这些乘法不值得单独卸载到GPU,而是卸载包含此乘法的算法的更大部分。 Noone ever liked using their own linalg functions since someone has always done it better. 没有人喜欢使用他们自己的linalg功能,因为有人总是做得更好。

TLDR What libraries can I play with while in inline kernels under PyCUDA? TLDR在PyCUDA下的内联内核中我可以使用哪些库?

I don't know of any, and I always thought it would be useful to have. 我不知道,我一直认为这样做会很有用。

For the size of problems that I usually work with (small matrices and tensors that arise in the finite element method), I just wrote C++ templates to do the operations. 对于我通常使用的问题的大小(有限元方法中出现的小矩阵和张量),我只是编写了C ++模板来进行操作。 Templating the functions allows the compiler to know the trip counts at compile time, and it can unroll loops and keep results or intermediate results in register, which tends to be very efficient for kernel throughput. 模板化函数允许编译器在编译时知道跳闸计数,并且它可以展开循环并将结果或中间结果保存在寄存器中,这对于内核吞吐量来说往往非常有效。 So the matrix-matrix product gets declared as 因此矩阵矩阵产品被声明为

template < typename Real, unsigned int l, unsigned int m, unsigned int n >
__device__ __host__ 
void matmul(const Real *a,
            const Real *b,
                  Real *c)
{
    for(int i=0; i<l; i++) {
        for(int j=0; j<n; j++) {
            Real dotprod = Real(0);
               for(int k=0; k<m; k++) {
                   dotprod += a[idx2c(i,k,l)] * b[idx2c(k,j,m)];
                }
                c[idx2c(i,j,l)] = dotprod;
           }
     }
}

For the sort of sizes that crop up in my kernels (2x2, 3x3, 4x4, 8x8, 9x9), doing the above and letting the compile work things out seems to be as good as any other approach I have tried. 对于在我的内核中出现的那种大小(2x2,3x3,4x4,8x8,9x9),执行上述操作并让编译工作似乎与我尝试过的任何其他方法一样好。 Because at the thread level CUDA is effectively scalar, there aren't any vector primitives or stuff like that which can be used to accelerate these sort of small operations. 因为在线程级别CUDA实际上是标量,所以没有任何矢量原语或类似的东西可用于加速这些小型操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM