简体   繁体   中英

How to copy a flattened 2D array from Global Memory to Shared Memory in CUDA

I have a kernel receiving a flattened 2D array, and I would like to copy one Line of the array each time the shared memory, my kernel looks like the following :

__global__ void searchKMP(char *test,size_t pitch_test,int ittNbr){
    int  tid = blockDim.x * blockIdx.x + threadIdx.x;
    int strideId = tid * 50;

    int m = 50;

    __shared__ char s_test[m];

    int j;
               //this loops over the number of lines in my 2D array           
                   for(int k=0; k<ittNbr; k++){

                   //this loops to store my flattened (basically threats 1 line at a time) array into shared memory     
                   if(threadIdx.x==0){
                     for(int n =0; n<50; ++n){
                    s_test[n] = *(((char*)test + k * pitch_test) + n);

                }
             }
            __syncthreads();


             j=0;

            //this is loop to process my shared memory array against another 1D array
             for(int i=strideID; i<(strideID+50); i++{
             ...dosomething...
             (increment x if a condition is met) 
             ...dosomething...
             }
             __syncthreads();
             if(x!=0)
                cache[0]+=x;

            ...dosomething...

}

although when I verify the values of x, the value of x varies, all the times, or varies with the number of threads. Example, 10 blocks of 500 threads returns 9 when 20 blocks of 250 threads is returning the value 7 or 6 depending of the executions. I wonder if the problem is coming from the 2D flattened array copied in shared memory or if something is done wrong in this bit of code.

It looks like your array in shared memory has 20 elements:

   int m = 20;
   __shared__ char s_test[m];

But in your inner loop you are trying to write 50 elements:

   for(int n =0; n<50; ++n){
      s_test[n] = *(((char*)test + k * pitch_test) + n);

I don't know if this is specifically the problem you were looking for, but that looks like it won't work.

shared memory is shared across all threads in the same block

it is not very clear, why you need shared memory and what you are doing:

in your code all threads in the block write the same values to your shared memory many times, but it is redundantly

common way to work with shared memeory is something like this:

if(threadIdx.x < m)
  s_test[threadIdx.x] = *(global_mem_pointer + threadIdx.x);

__syncthreads();

all threads in the block write their own value "at the same moment" and after __syncthreads(); your memory is filled with what you need and visible for all threads in the block

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM