简体   繁体   中英

How can I use shared memory here in my CUDA kernel?

I have the following CUDA kernel:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(data, investment, profitability);
    }
}

TL;DR I would like to find a way to store data in shared ( __shared__ ) memory. What I don't understand is how to fill the shared variable using multiple threads.

I have seen examples like this one where data is copied to shared memory thread by thread (eg myblock[tid] = data[tid] ), but I'm not sure how to do this in my situation. The issue is that each thread needs access to an entire "row" (flattened) of data with each iteration through the data set (see further below where the kernel is called).

I'm hoping for something like this:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, int propertyCount, double investment, double profitability) {
    __shared__ double sharedData[propertyCount];

    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(sharedData, investment, profitability);
    }
}

Here are more details (if more information is needed, please ask!):

strategies is a pointer to a list of Strategy objects, and data is a pointer to an allocated flattened data array.

In backtest() I access data like so:

data[0]
data[1]
data[2]
...

Unflattened, data is a fixed size 2D array similar to this:

[87.6, 85.4, 88.2, 86.1]
 84.1, 86.5, 86.7, 85.9
 86.7, 86.5, 86.2, 86.1
 ...]

As for the kernel call, I iterate over the data items and call it n times for n data rows (about 3.5 million):

int dataCount = 3500000;
int propertyCount = 4;

for (i=0; i<dataCount; i++) {
    unsigned int dataPointerOffset = i * propertyCount;

    // Notice pointer arithmetic.
    optimizer_backtest<<<32, 1024>>>(devData + dataPointerOffset, devStrategies, strategyCount, investment, profitability);
}

As confirmed in your comment, you want to apply 20k (this number is from your previous question) strategies on every one of the 3.5m data and exam the 20k x 3.5m results.

Without shared memory you have to read all data 20k times or all strategies 3.5m times, from the global memory.

Shared memory can speed up your program by reducing global memory access. Say you can read 1k strategies and 1k data to shared mem each time, exam the 1k x 1k results, and then repeat this until all are examed. By this way you can reduce the global mem access to 20 times of all data and 3.5k times of all strategies. This situation is similar to vector-vectoer cross product. You could find some reference code for more detail.

However each one of your data is large (838-D vector), maybe strategies are large too. You may not be able to cache a lot of them in the shared mem (only ~48k per block depending on the device type ). So the situation changes to something like matrix-matrix multiplication. For this, you may get some hints from the matrix multiplication code as in the following link.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

For people in the future in search of a similar answer, here is what I ended up with for my kernel function:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
    __shared__ double sharedData[838];

    if (threadIdx.x < 838) {
        sharedData[threadIdx.x] = data[threadIdx.x];
    }

    __syncthreads();

    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(sharedData, investment, profitability);
    }
}

Note that I use both .cuh and .cu files in my application, and I put this in the .cu file. Also note that I use --device-c in my Makefile when compiling object files. I don't know if that's how things should be done, but that's what worked for me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM