简体   繁体   English

如何在CUDA内核中使用共享内存?

[英]How can I use shared memory here in my CUDA kernel?

I have the following CUDA kernel: 我有以下CUDA内核:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(data, investment, profitability);
    }
}

TL;DR I would like to find a way to store data in shared ( __shared__ ) memory. TL; DR我想找到一种在共享( __shared__ )内存中存储data的方法。 What I don't understand is how to fill the shared variable using multiple threads. 我不明白的是如何使用多个线程来填充共享变量。

I have seen examples like this one where data is copied to shared memory thread by thread (eg myblock[tid] = data[tid] ), but I'm not sure how to do this in my situation. 我曾见过这样的例子这样一个地方data被线程复制到共享内存的线程(如myblock[tid] = data[tid]但我不知道如何在我的情况下做到这一点。 The issue is that each thread needs access to an entire "row" (flattened) of data with each iteration through the data set (see further below where the kernel is called). 问题在于,每个线程都需要通过数据集的每次迭代访问整个“行”(平整的)数据(请参见下文中调用内核的更多信息)。

I'm hoping for something like this: 我希望这样的事情:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, int propertyCount, double investment, double profitability) {
    __shared__ double sharedData[propertyCount];

    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(sharedData, investment, profitability);
    }
}

Here are more details (if more information is needed, please ask!): 以下是更多详细信息(如果需要更多信息,请询问!):

strategies is a pointer to a list of Strategy objects, and data is a pointer to an allocated flattened data array. strategies是一个指向列表Strategy对象,并且data是指向所分配的扁平数据数组。

In backtest() I access data like so: backtest()我这样访问数据:

data[0]
data[1]
data[2]
...

Unflattened, data is a fixed size 2D array similar to this: 未展平的数据是固定大小的2D数组,类似于此:

[87.6, 85.4, 88.2, 86.1]
 84.1, 86.5, 86.7, 85.9
 86.7, 86.5, 86.2, 86.1
 ...]

As for the kernel call, I iterate over the data items and call it n times for n data rows (about 3.5 million): 至于内核调用,我遍历数据项并为n个数据行(约350万)调用n次:

int dataCount = 3500000;
int propertyCount = 4;

for (i=0; i<dataCount; i++) {
    unsigned int dataPointerOffset = i * propertyCount;

    // Notice pointer arithmetic.
    optimizer_backtest<<<32, 1024>>>(devData + dataPointerOffset, devStrategies, strategyCount, investment, profitability);
}

As confirmed in your comment, you want to apply 20k (this number is from your previous question) strategies on every one of the 3.5m data and exam the 20k x 3.5m results. 正如您的评论中确认的那样,您想对3.5m数据中的每一个应用20k(此数字来自您先前的问题)并检查20k x 3.5m的结果。

Without shared memory you have to read all data 20k times or all strategies 3.5m times, from the global memory. 如果没有共享内存,则必须从全局内存读取20k次所有数据或3.5m次所有策略。

Shared memory can speed up your program by reducing global memory access. 共享内存可以通过减少全局内存访问来加速程序。 Say you can read 1k strategies and 1k data to shared mem each time, exam the 1k x 1k results, and then repeat this until all are examed. 假设您每次可以读取1k策略和1k数据以共享mem,检查1k x 1k的结果,然后重复进行直到所有内容都经过检查。 By this way you can reduce the global mem access to 20 times of all data and 3.5k times of all strategies. 这样,您可以将全局内存访问减少到所有数据的20倍和所有策略的3.5k倍。 This situation is similar to vector-vectoer cross product. 这种情况类似于矢量-矢量叉积。 You could find some reference code for more detail. 您可以找到一些参考代码以获取更多详细信息。

However each one of your data is large (838-D vector), maybe strategies are large too. 但是,您的每个数据都很大(838-D矢量),也许策略也很大。 You may not be able to cache a lot of them in the shared mem (only ~48k per block depending on the device type ). 您可能无法在共享内存中缓存很多缓存(根据设备类型,每个块只能缓存约48k)。 So the situation changes to something like matrix-matrix multiplication. 因此情况变成了矩阵矩阵乘法之类的东西。 For this, you may get some hints from the matrix multiplication code as in the following link. 为此,您可以从矩阵乘法代码中获得一些提示,如以下链接所示。

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

For people in the future in search of a similar answer, here is what I ended up with for my kernel function: 对于以后寻求类似答案的人们,这是我最终为我的内核函数准备的:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
    __shared__ double sharedData[838];

    if (threadIdx.x < 838) {
        sharedData[threadIdx.x] = data[threadIdx.x];
    }

    __syncthreads();

    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(sharedData, investment, profitability);
    }
}

Note that I use both .cuh and .cu files in my application, and I put this in the .cu file. 请注意,我在应用程序中同时使用了.cuh和.cu文件,并将其放在.cu文件中。 Also note that I use --device-c in my Makefile when compiling object files. 另请注意,编译目标文件时,我在Makefile中使用--device-c I don't know if that's how things should be done, but that's what worked for me. 我不知道这是应该怎么做,但这对我有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM