Cuda全局共享内存和常量内存

Question

I just started learning cuda and I'm having an issue converting some code to use shared memory and another to use constant memory, for comparison purposes. 我刚刚开始学习cuda，为了比较起见，在将某些代码转换为使用共享内存而将另一个代码转换为使用常量内存时遇到了问题。

__global__ void CUDA(int *device_array_Image1, int *device_array_Image2,int *device_array_Image3, int *device_array_kernel, int *device_array_Result1,int *device_array_Result2,int *device_array_Result3){

int i = blockIdx.x;
int j = threadIdx.x;


int ArraySum1 = 0 ; // set sum = 0 initially
int ArraySum2 = 0 ;
int ArraySum3 = 0 ;
for (int N = -1 ; N <= 1 ; N++)
{
    for (int M = -1 ; M <= 1 ; M++)
    {
        ArraySum1 = ArraySum1 + (device_array_Image1[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
        ArraySum2 = ArraySum2 + (device_array_Image2[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
        ArraySum3 = ArraySum3 + (device_array_Image3[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
    }
}

device_array_Result1[i * Image_Size + j] = ArraySum1;
device_array_Result2[i * Image_Size + j] = ArraySum2;
device_array_Result3[i * Image_Size + j] = ArraySum3;
}

This is what I have done so far but I'm having an issue understanding the shared and constant memory so if anyone could help with the code or point me in the right direction I'd be really grateful. 到目前为止，这是我所做的事情，但是在理解共享和不变的内存时遇到了问题，因此，如果有人可以帮助您编写代码或向正确的方向指出我，我将不胜感激。

Thanks for any help. 谢谢你的帮助。

Answer 1

a) Shared memory : This memory will be visible only to all threads in a block. a） 共享内存 ：该内存仅对块中的所有线程可见。 This shared memory is useful if you are accessing data more than once from that block.So in squaring of a number it will not be useful but while matrix multiplication it is useful. 如果从该块访问数据不止一次，则此共享内存很有用，因此在对数字进行平方时将无用，但在进行矩阵乘法时会很有用。
b) Constant memory : Data is stored in device global memory and data can be read through multiprocessor constant cache. b） 常量内存 ：数据存储在设备全局内存中，并且可以通过多处理器常量高速缓存读取数据。 64KB constant memory and 8KB cache is given to each multiprocessor.Data is broadcast to all threads in a warp.So if all the threads in the warp request the same value, that value is delivered to in a single cycle. 每个多处理器具有64KB的恒定内存和8KB的高速缓存，数据将被广播到warp中的所有线程，因此，如果warp中的所有线程都请求相同的值，则该值将在一个周期内传递到该线程。

Below links helped me in understanding constant and shared memory 下面的链接帮助我理解了常量和共享内存

1) http://cuda-programming.blogspot.in/2013/01/what-is-constant-memory-in-cuda.html 1） http://cuda-programming.blogspot.in/2013/01/what-is-constant-memory-in-cuda.html
2) http://cuda-programming.blogspot.in/2013/01/shared-memory-and-synchronization-in.html 2） http://cuda-programming.blogspot.in/2013/01/shared-memory-and-synchronization-in.html
3) https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/ 3） https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/

Please refer this links. 请参考此链接。

Cuda全局共享内存和常量内存

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-04-20 17:49:26

Cuda全局共享内存和常量内存

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-04-20 17:49:26

解决方案1
4 已采纳 2016-04-20 17:49:26