Using shared memory in CUDA without reducing threads

Question

Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:

For example CPU code:

for(int i = 0; i < ntr; i++)
{
    for(int j = 0; j < pos* posdir; j++)
    {
        val = x[i] * arr[j];
        if(val > 0.0)
        {
            out[xcount] = val*x[i];
            xcount += 1;
        }
    }
}

Equivalent GPU code:

const int threads = 64; 
num_blocks = ntr/threads;

__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    __shared__ float t1[threads];
    __shared__ float t2[threads];

    int gcount  = 0;

    for(int i = 0; i < posdir*pos; i += 32) {
        if (threadIdx.x < 32) {
            t1[threadIdx.x] = in2[i%posdir];
        }
       __syncthreads();

        for(int i = 0; i < 32; i++)
        {
            t2[i] = t1[i] * in1[tid];
                if(t2[i] > 0){
                    out1[gcount] = t2[i] * in1[tid];
                    gcount = gcount + 1;
                }
        }
    }        
    ct[0] = gcount;
}

what I am trying to do here is the following steps:

(1)Store 32 values of in2 in shared memory variable t1,

(2)For each value of i and in1[tid], calculate t2[i],

(3) if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]

But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.

Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?

I tried two approaches so far: (suggested by @paseolatis on NVIDIA forums)

(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount] offset=tid*32; and replace out1[gcount] with out1[offset+gcount] ,

(2) defined

__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];

int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800

Any suggestions? Thanks in advance !

Answer 1

OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging ).

Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
```
 if (threadIdx.x < 32) { t1[threadIdx.x] = in2[i%posdir]; } 
```
which is effectively loading the same value from in2 into every value of t1 . I suspect you want something more like this:
```
 if (threadIdx.x < 32) { t1[threadIdx.x] = in2[i+threadIdx.x]; } 
```
For each value of i and in1[tid] , calculate t2[i] ,
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
```
 float inval = in1[tid]; ....... for(int i = 0; i < 32; i++) { float result = t1[i] * inval; ...... 
```
if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
```
  if(t2[i] > 0){ out1[gcount] = t2[i] * in1[tid]; gcount = gcount + 1; } 
```
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).

The resulting kernel might look something like this:

__device__ int gcount; // must be set to zero before the kernel launch

__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    __shared__ float t1[32];

    float ival = in1[tid];

    for(int i = 0; i < posdir*pos; i += 32) {
        if (threadIdx.x < 32) {
            t1[threadIdx.x] = in2[i+threadIdx.x];
        }
        __syncthreads();

        for(int j = 0; j < 32; j++)
        {
            float tval = t1[j] * ival;
            if(tval > 0){
                int idx = atomicAdd(&gcount, 1);
                out1[idx] = tval * ival
            }
        }
    }        
}

Disclaimer: written in browser, never been compiled or tested, use at own risk.....

Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct .

EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset . It might look something like:

const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);

Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....

Answer 2

A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.

You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.

You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:

if (threadIdx.x < 32) {
    for(int i = threadIdx.x; i < posdir*pos; i+=32) {
        t1[i] = in2[i];
    }
}
__syncthreads();

for(int i = threadIdx.x; i < posdir*pos; i += 32) {
    for(int j = 0; j < 32; j++)
    {
         ...
    }
}

Then put a __syncthreads() , an atomic addition of gcount += count , and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.

Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.

Using shared memory in CUDA without reducing threads

Question

2 answers

solution1
2 2012-04-23 20:40:07

solution2
1 2012-04-23 20:14:01

Using shared memory in CUDA without reducing threads

Question

2 answers

solution1 2 2012-04-23 20:40:07

solution2 1 2012-04-23 20:14:01

solution1
2 2012-04-23 20:40:07

solution2
1 2012-04-23 20:14:01