简体   繁体   中英

Share memory in CUDA ? How does it CODE work?

I have a program to compute the the value of array : array A : have 32 elements, value form 0 -> 31. array B : have 16 elements, value = 0;

**I want to compute the value of B[i] following this rule : B[i]=A[i*2] + A[i*2+1]; i from 0 to 31 ** I use CUDA Programing with my example code :

Main.cu

 __global__ void Kernel(int *devB, int *devA)
    {  
    // Use share memory, 16 thread per block, so I use 16element for share memory in block 
    __shared__ int smA[16];
    //copy data from global memory to shared memory
    //1 thread copies 1 elementente
      smA[threadIdx.x] = devA[threadIdx.x + blockIdx.x * blockDim.x];
      __syncthreads(); 
    //8 thread in Block
      if (threadIdx.x < 8) 
    {
      devB[threadIdx.x + blockIdx.x * blockDim.x] = 
      smA[threadIdx.x * 2] + smA[threadIdx.x * 2 + 1];
    }
}

Void main

void main()
{
int *A = (int*)malloc(sizeof(int) * 32);
int *B = (int*)malloc(sizeof(int) * 16);

for (int i = 0; i < 32; i++)
A[i] = i;


    int *devA = NULL;
    cudaMalloc((void**)&devA, sizeof(int) * 32);
    cudaMemcpy(devA, A, sizeof(int) * 32), cudaMemcpyHostToDevice);
    int * devB = NULL;
    cudaMalloc((void**)&devB, sizeof(int) * 16);

    dim3 block(16, 1, 1);
    dim3 grid(2, 1, 1);

    Kernel<<<grid, block>>>(devB, devA);

    //copy back data to host
    cudaMemcpy(B, devB, sizeof(int) * 16, cudaMemcpyDeviceToHost);
    for (int i = 0; i < 16; i++) printf("%d/t", b[i]);

    if (A != NULL) free(A);
    if (B != NULL) free(B);
    if (devA != NULL) cudaFree(devA);
    if (devB != NULL) cudaFree(devB); 
}

So, I want to question : following my code above, I use Share memory int smnA[16] in the Kernel , and with 2 block = 2*16 thread Because Each thread execute a kernel ( from Seland.pdf ) => I will have 16x16 = 256 element in Share memory ? => it none logic !

No your assumption is wrong. Because shared memory can be used for interaction of threads within the same block, shared memory is allocated also for a whole thread block. In your example you will use 16 integer elements for every thread block. Total your kernel requires 32 integer elements to run all thread blocks simultaneously. Even if it's not the same but maybe you can compare it with static variables in c code.

If you write in your kernel something like the following code example, every thread will use it's own array with 16 elements. But this array can't be accessed by other threads (exception will be shuffle instructions).

__globa__ void kernel (...)
{
  int array_single_thread[16]; // Every thread instance has it's own array.
  ...
  __shared__ int array_thread_block[16]; // Once allocated for complete thread block.
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM