简体   繁体   中英

Writing to Shared Memory in CUDA without the use of a kernel

I want to create an array in my main() function, enter all the proper values, and then have this array immediately usable by threads in shared memory.

Every example I've looked up for how to use shared memory in CUDA has threads doing the writing into the shared array but I want my shared array immediately available before the kernel is launched.

Any help doing this would be much appreciated. Thanks in advance!

Some context: The shared array I want never changes and is read from by all threads.

edit: apparently this is not possible with shared memory. Does anyone know if it would be possible with the Read-Only cache?

It's not possible. The only way to populate shared memory is by using threads in CUDA kernels.

If you want a set of (read-only) data to be available to a kernel at launch, it's certainly possible to use __constant__ memory . Such memory could be set up on/by the host code using the API indicated in the documentation, ie cudaMemcpyToSymbol .

__constant__ memory is really only useful when each thread will access the same location in a given access cycle, eg

int myval = constant_data[12];

Otherwise use ordinary global memory, either static or dynamically allocated, using the appropriate host API to initialize (dynamic: cudaMemcpy , static: cudaMemcpyToSymbol ).

While the specific behavior you have requested is not possible automatically, this is actually a fairly common CUDA paradigm:

First, have all the threads copy the table into shmem.

sync threads

Access the data in your kernel.

This can be a large performance gain if you have fairly random accesses to the data, and if you expect to touch each entry on average more than a few times. Essentially you are using shmem as a managed cache, and aggregating the load from DRAM into shmem once for use many times. Further, shmem has no penalty for uncoalesced loads.

For example, you could code it something like this:

const int buffer_size = 8192; // assume an 8k buffer
float *device_buffer = ; // assume you have a buffer already on the device with the data you want.

my_kernel<<<num_blocks, num_threads, buffer_size>>>(..., buffer_size, device_buffer);

__global__ void my_kernel(..., int buffer_size, const float *device_buffer) {
   extern __shared__ float shmem_buffer[];
   for (int idx = threadIdx.x; idx < buffer_sze; idx += blockDim.x) {
       shmem_buffer[idx] = device_buffer[idx];
   }
   __syncthreads();

   // rest of your kernel goes here.  You can access data in shmem_buffer;
}

In other words, you just have to code the copy explicitly. Since all loads from DRAM will be perfectly coalesced, this should be close to optimally efficient.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM