简体   繁体   中英

shared memory in cuda

I have a matrix, u, of size NxN of complex data, and I'd like to multiply elementwise each row by a vector, k, of size 1xN. The data in u is stored row-wise.

I have two implementations of this. One which takes advantage of shared memory, dividing the matrix into tiles, and another which does not.

I am finding that the shared memory implementation, multiply1, is not faster, and is systematically just as fast, or even slower than multiply2.

The shared memory implementation is as follows,

__global__ void multiply1(cufftComplex *u, cufftComplex *k) {
     __shared__ cufftComplex k_s[BLOCK_WIDTH];
     int idx = blockDim.x*blockIdx.x + threadIdx.x;
     int idy = blockDim.y*blockIdx.y + threadIdx.y;
     int index;

     if (threadIdx.y == 0 && idx < N) {
         k_s[threadIdx.x] = k[idx];
     }
     __syncthreads();

     if (idx < N && idy < N) {
         index = N*idy + idx;
         u[index] = cuCmulf(k_s[threadIdx.x],u[index]);
     }

}

Whereas the global memory implementation is as follows,

__global__ void multiply2(cufftComplex *u, cufftComplex *k) {
        int idx = blockDim.x * blockIdx.x + threadIdx.x;

        if (idx < N*N) { 
            u[idx] =cuCmulf(k[idx % N],u[idx]);
        }
    }

and the main function call, for a matrix of size 64x64

dim3 block(16,16);
dim3 grid(4,4);
multiply1<<<grid, block>>>(d_u, d_k);
multiply2<<<16, 256>>>(d_u, d_k);

How can I use the profiler to find out why exactly multiply1 is not getting even a least a slight increase in speed? Which metrics would elucidate what exactly is going on?

The profiler is telling me that for multiply1, I am getting 152 GB/s global memory load throughput, whereas for multiply2 I am getting 81 GB/s. This is logical, because I'm loading less from global memory. Shouldn't this translate into quicker execution?

It would have been faster if you use it many times but here you are only using it one time. You transformed your problem in:

copy from global memory to shared memory
read from shared memory

instead of:

read from global memory

So yes it is certainly slower than the previous algorithm which use only the global memory. If you want to take advantage of shared memory, your algorithm as to read many times in it ifnot you won't overhead the global memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM