简体   繁体   English

为什么我的内核没有超过共享内存限制?

[英]How come my kernel doesn't exceed the shared memory limit?

I am calling CUDA kernels from matlab. 我正在从matlab调用CUDA内核。

I was previously told that ( David Kirk's book) one could only take 16kb of shared memory per thread, but I am able to consume far more than that: 以前有人告诉我(David Kirk的书),每个线程只能占用16kb的共享内存,但是我所消耗的内存远远超过此:

__global__ void plain(float* arg)
{

    __shared__ float array[12000];
    int k;

    for (k=1;k<12000; k++)
    {
        array[k]=1;
    }   
}

CUDA C reports that a float is 4 bytes, meaning that total array size is 48Kb which is greater than 12Kb. CUDA C报告浮点数为4个字节,这意味着数组的总大小为48Kb,大于12Kb。 It runs fine, so how can this be? 运行正常,怎么可能呢?

I am also told in GPU shared memory size is very small - what can I do about it? 我还被告知GPU共享内存的大小非常小-我该怎么办? that the max shared mem per block is important. 每个块的最大共享内存很重要。 Max shared memory per block for my card is 49152 bytes, yet I am able to run the above code with 1000 threads per block. 卡的每个块的最大共享内存为49152字节,但是我能够以每个块1000个线程的速度运行上述代码。

it seems like it would use 49Kb per block, which can't be right. 似乎每块将使用49Kb,这是不对的。 Is it that the SM only services one block at once and in dong preserves the condition that there can only be 49Kb per thread block? 是SM一次只服务一个块,并且保留了每个线程块只能有49Kb的条件吗?

How is 49Kb shared mem per block reconciled with 16Kb shared memory per thread? 每个块的49Kb共享内存与每个线程的16Kb共享内存如何协调?

Thanks 谢谢

Shared memory is allocated per thread block, with as much as 48 KB available per SM with compute capability 2.0 and up. 每个线程块分配共享内存,每个具有2.0和更高计算能力的SM最多可提供48 KB。 So on a given SM you could be running a single thread block that consumes the entire 48 KB or, say, three thread blocks each of which allocates 16 KB. 因此,在给定的SM上,您可能正在运行一个占用整个48 KB的线程块,或者说说三个线程块,每个线程块分配16 KB。 The limit of 16 KB of shared memory per SM applies to compute capabilities < 2.0. 每个SM的16 KB共享内存的限制适用于<2.0的计算能力。 As opposed to shared memory, which is allocated per thread block, local memory ("local" meaning "thread local") is allocated per thread. 与按线程块分配的共享内存相反,按线程分配本地内存(“ local”表示“ thread local”)。

Threads don't have shared memory. 线程没有共享内存。 Your code uses "block" shared memory (there is no other shared memory in CUDA) 您的代码使用“阻止”共享内存(CUDA中没有其他共享内存)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM