简体   繁体   English

在CUDA中共享内存? CODE如何运作?

[英]Share memory in CUDA ? How does it CODE work?

I have a program to compute the the value of array : array A : have 32 elements, value form 0 -> 31. array B : have 16 elements, value = 0; 我有一个程序来计算array的值:array A:有32个元素,值形式为0->31。array B:有16个元素,value = 0;

**I want to compute the value of B[i] following this rule : B[i]=A[i*2] + A[i*2+1]; **我想按照此规则计算B [i]的值:B [i] = A [i * 2] + A [i * 2 + 1]; i from 0 to 31 ** I use CUDA Programing with my example code : 我从0到31 **我在示例代码中使用CUDA编程:

Main.cu Main.cu

 __global__ void Kernel(int *devB, int *devA)
    {  
    // Use share memory, 16 thread per block, so I use 16element for share memory in block 
    __shared__ int smA[16];
    //copy data from global memory to shared memory
    //1 thread copies 1 elementente
      smA[threadIdx.x] = devA[threadIdx.x + blockIdx.x * blockDim.x];
      __syncthreads(); 
    //8 thread in Block
      if (threadIdx.x < 8) 
    {
      devB[threadIdx.x + blockIdx.x * blockDim.x] = 
      smA[threadIdx.x * 2] + smA[threadIdx.x * 2 + 1];
    }
}

Void main 虚空主

void main()
{
int *A = (int*)malloc(sizeof(int) * 32);
int *B = (int*)malloc(sizeof(int) * 16);

for (int i = 0; i < 32; i++)
A[i] = i;


    int *devA = NULL;
    cudaMalloc((void**)&devA, sizeof(int) * 32);
    cudaMemcpy(devA, A, sizeof(int) * 32), cudaMemcpyHostToDevice);
    int * devB = NULL;
    cudaMalloc((void**)&devB, sizeof(int) * 16);

    dim3 block(16, 1, 1);
    dim3 grid(2, 1, 1);

    Kernel<<<grid, block>>>(devB, devA);

    //copy back data to host
    cudaMemcpy(B, devB, sizeof(int) * 16, cudaMemcpyDeviceToHost);
    for (int i = 0; i < 16; i++) printf("%d/t", b[i]);

    if (A != NULL) free(A);
    if (B != NULL) free(B);
    if (devA != NULL) cudaFree(devA);
    if (devB != NULL) cudaFree(devB); 
}

So, I want to question : following my code above, I use Share memory int smnA[16] in the Kernel , and with 2 block = 2*16 thread Because Each thread execute a kernel ( from Seland.pdf ) => I will have 16x16 = 256 element in Share memory ? 因此,我想问一个问题:按照上面的代码,我在内核中使用Share memory int smnA [16],并使用2个块= 2 * 16线程, 因为每个线程都执行一个内核 (来自Seland.pdf)=> 共享内存中有16x16 = 256个元素 => it none logic ! =>这没有逻辑!

No your assumption is wrong. 没有您的假设是错误的。 Because shared memory can be used for interaction of threads within the same block, shared memory is allocated also for a whole thread block. 因为共享内存可用于同一块内线程的交互,所以共享内存也分配给整个线程块。 In your example you will use 16 integer elements for every thread block. 在您的示例中,每个线程块将使用16个整数元素。 Total your kernel requires 32 integer elements to run all thread blocks simultaneously. 您的内核总共需要32个整数元素才能同时运行所有线程块。 Even if it's not the same but maybe you can compare it with static variables in c code. 即使它不一样,但也许您可以将其与C代码中的静态变量进行比较。

If you write in your kernel something like the following code example, every thread will use it's own array with 16 elements. 如果您在内核中编写类似以下代码示例的内容,则每个线程将使用它自己的包含16个元素的数组。 But this array can't be accessed by other threads (exception will be shuffle instructions). 但是该数组不能被其他线程访问(例外是随机播放指令)。

__globa__ void kernel (...)
{
  int array_single_thread[16]; // Every thread instance has it's own array.
  ...
  __shared__ int array_thread_block[16]; // Once allocated for complete thread block.
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM