copy to the shared memory in cuda

Question

In CUDA programming, if we want to use shared memory, we need to bring the data from global memory to shared memory. Threads are used for transferring such data.

I read somewhere (in online resources) that it is better not to involve all the threads in the block for copying data from global memory to shared memory. Such idea makes sense that all the threads are not executed together. Threads in a warp execute together. But my concern is all the warps are not executed sequentially. Say, a block with threads is divided into 3 warps: war p0 (0-31 threads), warp 1 (32-63 threads), warp 2 (64-95 threads). It is not guaranteed that warp 0 will be executed first (am I right?).

So which threads should I use to copy the data from global to shared memory?

Answer 1

To use a single warp to load a shared memory array, just do something like this:

__global__
void kernel(float *in_data)
{
    __shared__ float buffer[1024];

    if (threadIdx.x < warpSize) {
        for(int i = threadIdx; i  <1024; i += warpSize) {
            buffer[i] = in_data[i];
        }
    }
    __syncthreads();

    // rest of kernel follows
}

[disclaimer: written in browser, never tested, use at own risk]

The key point here is the use of __syncthreads() to ensure that all threads in the block wait until the warp performing the load to shared memory have finished the load. The code I posted used the first warp, but you can calculate a warp number by dividing the thread index within the block by the warpSize. I also assumed a one-dimensional block, it is trivial to compute the thread index in a 2D or 3D block, so I leave that as an exercise to the reader.

Answer 2

As block is assigned to multiprocessor it resides there until all threads within that block are finished and during this time warp scheduler is mixing among warps that have ready operands. So if there is one block on multiprocessor with three warps and just one warp is fetching data from global to shared memory and other two warps are staying idle and probably waiting on __syncthreads() barrier, you loose nothing and you are limited just by latency of global memory what you would have been anyway. As soon as fetching is finished warps can go ahead in their work.

Therefore, no guarantee that warp0 is executed first is needed and you can use any threads. The only two things to keep in mind are to ensure as much coalesced access to global memory as possible and avoidance of bank conflicts.

copy to the shared memory in cuda

Question

2 answers

solution1
7 2013-03-18 07:03:32

solution2
0 2013-03-18 00:54:39

copy to the shared memory in cuda

Question

2 answers

solution1 7 2013-03-18 07:03:32

solution2 0 2013-03-18 00:54:39

solution1
7 2013-03-18 07:03:32

solution2
0 2013-03-18 00:54:39