简体   繁体   English

如果为整个网格分配的共享内存量超过48kB,则内核启动失败

[英]Kernel launch failure if the amount of shared memory allocated for the whole grid exceeds 48kB

I am working on a N-body problem requiring a large amount of shared memory . 我正在研究需要大量共享内存的N体问题。

Basically, there are N independent tasks, each one using 4 doubles variables, ie 32 bytes. 基本上,有N独立的任务,每个任务使用4个double变量,即32个字节。 And a single task is executed by a thread. 单个任务由线程执行。

For the sake of rapidity, I have been using the shared memory for these variables (given that registers are also being used by threads). 为了快速起见,我一直在对这些变量使用共享内存 (假设线程也正在使用寄存器)。 Since the number N of tasks is not known at compile time, the shared memory is dynamically allocated. 由于在编译时不知道任务数N ,因此将动态分配共享内存

  • The dimension of the grid and the shared memory are computed depending on N and the block size: 网格和共享内存的大小取决于N和块大小:

     const size_t BLOCK_SIZE = 512; const size_t GRID_SIZE = (N % BLOCK_SIZE) ? (int) N/BLOCK_SIZE : (int) N/BLOCK_SIZE +1; const size_t SHARED_MEM_SIZE = BLOCK_SIZE * 4 * sizeof(double); 
  • Then the kernel is launched using these 3 variables. 然后,使用这3个变量启动内核。

     kernel_function<<<GRID_SIZE, BLOCK_SIZE, SHARED_MEM_SIZE>>>(N, ...); 

For small N , this works fine and the kernel is executed without error. 对于小N ,这可以正常工作,并且内核执行无错误。

But if a exceed N = 1500 , the kernel launch fails (with the following messages appearing multiple times): 但是,如果超过N = 1500 ,则内核启动失败(以下消息多次出现):

========= Invalid __global__ write of size 8
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaLaunch. 

As far as I understand, this is due to the attempt of writing out of the bounds of the allocated shared memory . 据我了解,这是由于尝试写出已分配的共享内存的界限。 This occurs when, in the kernel, the global memory is being copied in the shared memory : 在内核中将全局内存复制到共享内存中时,会发生这种情况:

__global__ void kernel_function(const size_t N, double *pN, ...)
{
    unsigned int idx = threadIdx.x + blockDim.x * blockIdx.x;

    if(idx<N)
    {
        extern __shared__ double pN_shared[];
        for(int i=0; i < 4; i++)
        {
            pN_shared[4*idx + i] = pN[4*idx + i];
        }
        ...
    }
}

This error happens only if N > 1500 , hence when the overall amount of shared memory exceeds 48kB ( 1500 * 4 * sizeof(double) = 1500 * 32 = 48000 ). 仅当N > 1500时才会发生此错误,因此,当共享内存的总量超过48kB1500 * 4 * sizeof(double) = 1500 * 32 = 48000 )时。
This limit is the same regardless of the grid and the block size. 无论网格和块大小如何,此限制都是相同的。

If I have understood correctly how CUDA works, the cumulated amount of shared memory that the grid uses is not limited to 48kB , and this is only the limit of shared memory that can be used by a single thread block. 如果我正确理解了CUDA的工作方式,则网格使用的共享内存的累积量不限于48kB ,这仅是单个线程块可以使用的共享内存的限制。

This error makes no sense to me since the cumulated amount of shared memory should only affect the way the grid is scheduled among the streaming multiprocessors (and moreover the GPU device has 15 SM at disposal). 由于共享内存的累积量仅会影响流式多处理器之间的网格调度方式(而且GPU设备有15个SM可供使用),因此该错误对我而言毫无意义。

The amount of shared memory you are allocating dynamically here: 您在此处动态分配的共享内存量:

kernel_function<<<GRID_SIZE, BLOCK_SIZE, SHARED_MEM_SIZE>>>(N, ...);
                                         ^^^^^^^^^^^^^^^

is the amount per threadblock , and that amount is limited to 48KB (which is 49152, not 48000). 每个线程块的数量,并且该数量限制为48KB(即49152,而不是48000)。 So if you attempt to allocate more than 48KB there, you should get an error if you are checking for it. 因此,如果您尝试在那里分配超过48KB的空间,则在检查时会出现错误。

However we can draw two conclusions from this: 但是我们可以从中得出两个结论:

========= Invalid __global__ write of size 8
  1. A kernel did actually launch. 内核确实启动了。
  2. The reported failure has to do with out-of-bounds indexing into global memory, on a write to global memory, not shared memory. 报告的故障有出界外的索引做成全局内存,在全局存储器,而不是共享内存的写入 (So, it cannot be occurring on a read from global memory to populate shared memory, as your conjecture suggests.) (因此,正如您的推测所示,从全局内存读取以填充共享内存不会发生这种情况。)

So in general I think your conclusions are incorrect, and you probably need to do more debugging, rather than arriving at the conclusions about shared memory. 因此,总的来说,我认为您的结论不正确,您可能需要进行更多的调试,而不是得出有关共享内存的结论。

If you want to track down the source of the invalid global write to a specific line of code in your kernel, this question/answer may be of interest. 如果您想跟踪对内核中特定代码行的无效全局写入的源,则可能会对此问题/答案感兴趣。

You are accessing the shared array at index idx*4+0:3. 您正在访问索引为IDx * 4 + 0:3的共享阵列。 The program is incorrect starting at N > BLOCK_SIZE. 从N> BLOCK_SIZE开始,程序不正确。 Luckily it seems to work up to 1500. But using cuda mem-check should point out the issue. 幸运的是,它似乎可以工作到1500。但是使用cuda mem-check应该指出问题所在。 On a related topic, note that statically allocated shared memory in another location might use shared memory. 在相关主题上,请注意在其他位置静态分配的共享内存可能会使用共享内存。 Printing out the value of the pointer will help figuring out. 打印出指针的值将有助于弄清楚指针。

I think the issue here is that all threads inside a block must run in the same SM. 我认为这里的问题是,块中的所有线程必须在同一SM中运行。 Therefore each block still has the hard limit of 48kB of shared memory. 因此,每个块仍具有共享存储器48kB的硬限制。 It does not matter how many threads are run in that block. 在该块中运行多少个线程无关紧要。 Scheduling does not matter since the GPU can not split the threads in a block across multiple SMs. 调度无关紧要,因为GPU无法将线程拆分为多个SM中的一个块。 I would try to reduce the BLOCK_SIZE if you can since that will directly determine the amount of shared memory per block. 如果可以的话,我将尝试减少BLOCK_SIZE,因为这将直接确定每个块的共享内存量。 However if you reduce it too far you can run into issues where you are not fully utilizing the compute resources in an SM. 但是,如果将其减少得太多,则会遇到无法充分利用SM中的计算资源的问题。 It is a balancing act and from my experience the CUDA architecture presents a lot of interesting trade-offs like this. 这是一种平衡的行为,根据我的经验,CUDA架构会提出很多有趣的折衷方案。

Also in your case I am not even sure you need shared memory. 同样在您的情况下,我什至不确定您是否需要共享内存。 I would just use a local variable. 我只会使用局部变量。 I think local variables are stored in global memory but the access to them is aligned so it is very fast. 我认为局部变量存储在全局内存中,但是对它们的访问是对齐的,因此速度非常快。 If you want to do something neat with shared memory to improve the performance here is the OpenCL kernel of my N-Body simulator. 如果您想对共享内存做一些整齐的事情以提高性能,请使用我的N-Body模拟器的OpenCL内核。 Using shared memory to create a cache for every thread in a block gives me about a 10x speedup. 使用共享内存为一个块中的每个线程创建一个缓存,使我的速度提高了10倍。

In this model each thread is responsible for calculating the acceleration on one body as a result of the gravitational attraction on every other body. 在此模型中,由于每个其他物体上的重力吸引,每个线程负责计算一个物体上的加速度。 This requires each thread looping through all N bodies. 这要求每个线程循环遍历所有N个主体。 This is enhanced with the shared memory cache since each thread in a block can load a different body into the shared memory and they can share them. 共享内存缓存可以增强此功能,因为块中的每个线程都可以将不同的主体加载到共享内存中,并且可以共享它们。

__kernel void acceleration_kernel
(
    __global const double* masses, 
    __global const double3* positions,
    __global double3* accelerations,
    const double G,
    const int N,
    __local double4* cache //shared memory cache (local means shared memory in OpenCL)
)
{
    int idx = get_global_id(0);
    int lid = get_local_id(0);
    int lsz = get_local_size(0);

    if(idx >= N)
        return;

    double3 pos = positions[idx];
    double3 a = { };

    //number of loads required to compute accelerating on Body(idx) from all other bodies
    int loads = (N + (lsz - 1)) / lsz;

    for(int load = 0; load < loads; load++)
    {
        barrier(CLK_LOCAL_MEM_FENCE);

        //compute which body this thread is responsible for loading into the cache
        int load_index = load * lsz + lid;
        if(load_index < N)
            cache[lid] = (double4)(positions[load_index], masses[load_index]);

        barrier(CLK_LOCAL_MEM_FENCE);

        //now compute the acceleration from every body added to the cache
        for(int i = load * lsz, j = 0; i < N && j < lsz; i++, j++)
        {
            if(i == idx)
                continue;
            double3 r_hat = cache[j].xyz - pos; 
            double over_r = rsqrt(0.0001 + r_hat.x * r_hat.x + r_hat.y * r_hat.y + r_hat.z * r_hat.z);
            a += r_hat * G * cache[j].w * over_r * over_r * over_r;
        }
    }

    accelerations[idx] = a;
}
double3 pos = positions[idx];
double3 a = { };

int loads = (N + (lsz - 1)) / lsz;

for(int load = 0; load < loads; load++)
{
    barrier(CLK_LOCAL_MEM_FENCE);
    int load_index = load * lsz + lid;
    if(load_index < N)
        cache[lid] = (double4)(positions[load_index], masses[load_index]);
    barrier(CLK_LOCAL_MEM_FENCE);

    for(int i = load * lsz, j = 0; i < N && j < lsz; i++, j++)
    {
        if(i == idx)
            continue;
        double3 r_hat = cache[j].xyz - pos; 
        double over_r = rsqrt(0.0001 + r_hat.x * r_hat.x + r_hat.y * r_hat.y + r_hat.z * r_hat.z);
        a += r_hat * G * cache[j].w * over_r * over_r * over_r;
    }
}

accelerations[idx] = a;

} }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM