在CUDA中使用共享内存而不减少线程

Question

Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation: 看看Mark Harris的缩减示例，我试图看看我是否可以让线程存储中间值而不进行缩减操作：

For example CPU code: 例如CPU代码：

for(int i = 0; i < ntr; i++)
{
    for(int j = 0; j < pos* posdir; j++)
    {
        val = x[i] * arr[j];
        if(val > 0.0)
        {
            out[xcount] = val*x[i];
            xcount += 1;
        }
    }
}

Equivalent GPU code: 等效的GPU代码：

const int threads = 64; 
num_blocks = ntr/threads;

__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    __shared__ float t1[threads];
    __shared__ float t2[threads];

    int gcount  = 0;

    for(int i = 0; i < posdir*pos; i += 32) {
        if (threadIdx.x < 32) {
            t1[threadIdx.x] = in2[i%posdir];
        }
       __syncthreads();

        for(int i = 0; i < 32; i++)
        {
            t2[i] = t1[i] * in1[tid];
                if(t2[i] > 0){
                    out1[gcount] = t2[i] * in1[tid];
                    gcount = gcount + 1;
                }
        }
    }        
    ct[0] = gcount;
}

what I am trying to do here is the following steps: 我在这里尝试做的是以下步骤：

(1)Store 32 values of in2 in shared memory variable t1, （1）在共享内存变量t1中存储32的in2值，

(2)For each value of i and in1[tid], calculate t2[i], （2）对于i和in1 [tid]的每个值，计算t2 [i]，

(3) if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount] （3） if t2[i] > 0对于i的特定组合if t2[i] > 0 ， if t2[i] > 0 t2[i]*in1[tid]写入out1[gcount]

But my output is all wrong. 但我的输出都是错的。 I am not even able to get a count of all the times t2[i] is greater than 0. 我甚至无法计算t2 [i]大于0的所有时间。

Any suggestions on how to save the value of gcount for each i and tid ?? 关于如何为每个i和tid保存gcount值的任何建议？ As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. 在调试时，我发现对于块（0,0,0）和线程（0,0,0），我可以顺序地看到更新的t2的值。 After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. 在CUDA内核将焦点切换到块（0,0,0）和线程（32,0,0）之后，再次重写out1 [0]的值。 How can I get/store the values of out1 for each thread and write it to the output? 如何为每个线程获取/存储out1的值并将其写入输出？

I tried two approaches so far: (suggested by @paseolatis on NVIDIA forums) 到目前为止我尝试了两种方法:(由@paseolatis在NVIDIA论坛上提出）

(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount] （1）定义offset=tid*32; and replace out1[gcount] with out1[offset+gcount] offset=tid*32; and replace out1[gcount] with out1[offset+gcount] , offset=tid*32; and replace out1[gcount] with out1[offset+gcount] ，

(2) defined （2）定义

__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];

int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800

Any suggestions? 有什么建议么？ Thanks in advance ! 提前致谢！

Answer 1

OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging ). 好吧，让我们比较你对代码应该做什么的描述和你发布的内容（这有时称为橡皮鸭调试）。

Store 32 values of in2 in shared memory variable t1 在共享内存变量t1存储32个in2值
Your kernel contains this: 你的内核包含这个：
```
 if (threadIdx.x < 32) { t1[threadIdx.x] = in2[i%posdir]; } 
```
which is effectively loading the same value from in2 into every value of t1 . 这有效地将in2 的相同值加载到t1每个值中。 I suspect you want something more like this: 我怀疑你想要更像这样的东西：
```
 if (threadIdx.x < 32) { t1[threadIdx.x] = in2[i+threadIdx.x]; } 
```
For each value of i and in1[tid] , calculate t2[i] , 对于i和in1[tid]每个值，计算t2[i] ，
This part is OK, but why is t2 needed in shared memory at all? 这部分没问题，但为什么共享内存需要t2呢？ It is only an intermediate result which can be discarded after the inner iteration is completed. 它只是一个中间结果，可以在内部迭代完成后丢弃。 You could easily have something like: 你可以轻松地拥有类似的东西：
```
 float inval = in1[tid]; ....... for(int i = 0; i < 32; i++) { float result = t1[i] * inval; ...... 
```
if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount] 如果对于i的特定组合t2[i] > 0 ，则将t2[i] > 0 t2[i]*in1[tid]写入out1[gcount]
This is where the problems really start. 这就是问题真正开始的地方。 Here you do this: 在这里你这样做：
```
  if(t2[i] > 0){ out1[gcount] = t2[i] * in1[tid]; gcount = gcount + 1; } 
```
This is a memory race. 这是一场记忆竞赛。 gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. gcount是一个线程局部变量，因此每个线程将在不同的时间用自己的值覆盖任何给定的out1[gcount] 。 What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. 为了使这段代码能够正常工作，你必须拥有的是将gcount作为全局内存变量并使用原子内存更新来确保每个线程在每次输出值时都使用唯一的gcount值。 But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment). 但是请注意，如果经常使用原子内存访问非常昂贵（这就是为什么我在评论中询问每个内核启动时有多少输出点）。

The resulting kernel might look something like this: 生成的内核可能如下所示：

__device__ int gcount; // must be set to zero before the kernel launch

__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    __shared__ float t1[32];

    float ival = in1[tid];

    for(int i = 0; i < posdir*pos; i += 32) {
        if (threadIdx.x < 32) {
            t1[threadIdx.x] = in2[i+threadIdx.x];
        }
        __syncthreads();

        for(int j = 0; j < 32; j++)
        {
            float tval = t1[j] * ival;
            if(tval > 0){
                int idx = atomicAdd(&gcount, 1);
                out1[idx] = tval * ival
            }
        }
    }        
}

Disclaimer: written in browser, never been compiled or tested, use at own risk..... 免责声明：用浏览器编写，从未编译或测试过，使用风险自负......

Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct . 请注意，您对ct的写入也是内存竞争，但是gcount现在是一个全局值，您可以在内核之后读取值而无需ct 。

EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. 编辑：在运行内核之前，您似乎遇到了将gcount零的问题。 To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset . 为此，您需要使用cudaMemcpyToSymbol或cudaGetSymbolAddress和cudaMemset 。 It might look something like: 它可能看起来像：

const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);

Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk..... 同样，通常的免责声明：用浏览器编写，从未编译或测试过，使用风险自负......

Answer 2

A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. 做你正在做的更好的方法是给每个线程自己的输出，让它增加自己的count并输入值 - 这样，double-for循环可以以任何顺序并行发生，这就是GPU做得好 The output is wrong because the threads share the out1 array, so they'll all overwrite on it. 输出是错误的，因为线程共享out1数组，所以它们都会覆盖它。

You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. 您还应该将代码移动到共享内存中，并将其复制到一个单独的循环中，然后使用__syncthreads() 。 With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer. 随着__syncthreads()离开循环，你应该获得更好的性能 - 这意味着你的共享数组必须是in2的大小 - 如果这是一个问题，有一个更好的方法来处理这个在这结束时回答。

You also should move the threadIdx.x < 32 check to the outside. 您还应该将threadIdx.x < 32检查移到外面。 So your code will look something like this: 所以你的代码看起来像这样：

if (threadIdx.x < 32) {
    for(int i = threadIdx.x; i < posdir*pos; i+=32) {
        t1[i] = in2[i];
    }
}
__syncthreads();

for(int i = threadIdx.x; i < posdir*pos; i += 32) {
    for(int j = 0; j < 32; j++)
    {
         ...
    }
}

Then put a __syncthreads() , an atomic addition of gcount += count , and a copy from the local output array to a global one - this part is sequential, and will hurt performance. 然后放一个__syncthreads() ，一个gcount += count的原子添加，以及一个从本地输出数组到全局输出的副本 - 这部分是顺序的，会损害性能。 If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU. 如果可以的话，我会为每个本地数组提供一个指向数组指针的全局列表，并将它们放在CPU上。

Another change is that you don't need shared memory for t2 - it doesn't help you. 另一个变化是你不需要t2的共享内存 - 它对你没有帮助。 And the way you are doing this, it seems like it works only if you are using a single block. 而你这样做的方式，似乎只有在使用单个块时才有效。 To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. 要从大多数NVIDIA GPU中获得良好性能，您应该将其划分为多个块。 You can tailor this to your shared memory constraint. 您可以根据共享内存约束来定制它。 Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop. 当然，块之间没有__syncthreads() ，因此每个块中的线程必须遍历内部循环的整个范围，以及外部循环的分区。

在CUDA中使用共享内存而不减少线程

问题描述

2 个解决方案

解决方案1
2 2012-04-23 20:40:07

解决方案2
1 2012-04-23 20:14:01

在CUDA中使用共享内存而不减少线程

问题描述

2 个解决方案

解决方案1 2 2012-04-23 20:40:07

解决方案2 1 2012-04-23 20:14:01

解决方案1
2 2012-04-23 20:40:07

解决方案2
1 2012-04-23 20:14:01