简体   繁体   中英

CUDA: Shift arrays on shared memory

I am trying to load a flattened 2D matrix into shared memory, shift the data along x, write back to global memory shifting also along y. The input data is therefore shifted along x and y. What I have:

__global__ void test_shift(float *data_old, float *data_new)

{

uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;

__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];

// load from global to shared

VAR = data_old[glob_index];

// do some stuff on VAR 

if (threadIdx.x < NUM_THREADS - 1)
{
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}

__syncthreads();

// write to global memory

if (threadIdx.y < ny - 1)
{
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x; // redefine glob_index to shift along y (+1)
data_new[glob_index] = VAR2[threadIdx.x];
}

The call to the kernel:

test_shift <<< grid, block >>> (data_old, data_new);

and grid and blocks (blockDim.x is equal to the matrix width, ie 64):

dim3 block(NUM_THREADS, 1);
dim3 grid(1, ny); 

I am not able to achieve it. Could someone please point out what's wrong with this? Should I use a strided index or an offset?

VAR should not have been declared as shared, because in the current form all threads scribble over each other's data when you load from global memory: VAR = data_old[glob_index]; .

You also have an out-of-bounds access when you access VAR2[threadIdx.x + 1] , so your kernel never finishes (depending on the compute capability of the device - 1.x devices didn't check shared memory accesses as rigorously).

You could have detected the latter by checking the return codes of all calls to CUDA functions for errors.

Shared variables are, well, shared by all threads in a single block. This means that you don't have blockDim.y complects of shared variables but only a single complect per block.

uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;

__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
VAR = data_old[glob_index];

if (threadIdx.x < NUM_THREADS - 1)
{
  VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}

This instructs all threads in a block to write data into a single variable (VAR). Next you have no synchronization, and you use this variable in the second assignment. This will have undefined result, because threads from the first warp are reading from this variable and threads from the second warp are still trying to write something there. You should change VAR to be local, or create an array of shared memory variables for all threads in block.

if (threadIdx.y < ny - 1)
{
  glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x; 
  data_new[glob_index] = VAR2[threadIdx.x];
}

In VAR2[0] you still have some garbage (you've never written there). threadIdx.y is always zero in your blocks.

And avoid using uints. They have (or used to have) some perfomance problems.

Actually, for such simple task you don't need to use shared memory

__global__ void test_shift(float *data_old, float *data_new)
{

int glob_index = threadIdx.x + blockIdx.y*blockDim.x;

float VAR;

// load from global to local
VAR = data_old[glob_index];

int glob_index_new;
// calculate only if we are going to output something
if ( (blockIdx.y < gridDim.y - 1) && ( threadIdx.x < blockDim.x - 1 ))
{
  glob_index_new = threadIdx.x + 1 + (blockIdx.y + 1)*blockDim.x;

  // do some stuff on VAR 
} else // just write 0.0 to remove garbage
{
  glob_index_new = ( (blockIdx.y == gridDim.y - 1) && ( threadIdx.x == blockDim.x - 1 ) ) ? 0 : ((blockIdx.y == gridDim.y - 1) ? threadIdx.x : (blockIdx.y)*blockDim.x );
  VAR = 0.0;
} 

// write to global memory

data_new[glob_index_new] = VAR;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM