减少cuda内核运行时：内核中矩阵的动态内存分配

Question

I want to perform OLS fit for a very large number of smaller matrices by running the matrix operations in parallel on a GPU. 我想通过在GPU上并行运行矩阵运算来执行适用于大量较小矩阵的OLS。 I have written code which seems to be functioning, however it is slower than anticipated. 我写的代码似乎正在运行，但是它比预期的要慢。 Currently, it takes shorter time to run it on a single thread on CPU despite the parallel computations on the GPU. 当前，尽管在GPU上进行并行计算，但在CPU上的单个线程上运行它所需的时间却较短。 Nvidia Visual Profiler seems to indicate that the memory allocation is taking up a lot of time. Nvidia Visual Profiler似乎表明内存分配占用了大量时间。 I suspect it is the dynamic memory allocation of different sized matrices inside the kernel that is the culprit. 我怀疑罪魁祸首是内核内部不同大小矩阵的动态内存分配。 I need advice and help with speeding up the kernel runtime. 我需要有关加快内核运行时间的建议和帮助。

I have tried using new and delete for each matrix created in the loop. 我尝试对循环中创建的每个矩阵使用new和delete。

Here is the kernel: 这是内核：

__global__
void comb_ols(double *y, double *X, double *R2 ,const unsigned int M, const unsigned int N, int* sub_col, int *sub_size, int* cumulative_size, const unsigned int numberOfCalculations){

    int size;   
    int start_index;

    int index = blockIdx.x*blockDim.x+threadIdx.x;
    int stride = blockDim.x*gridDim.x;  
    for(int i = index; i < numberOfCalculations; i+=stride){    

        size = sub_size[i];
        start_index = cumulative_size[i];             

        double *sub_matrix = new double[M*(1+size)];


            for(int j = 0; j < size; j++){
            for(int k  = 0; k<M; k++){
                sub_matrix[k] = 1;
                sub_matrix[k + M * (1 +  j)] = X[k + M * (sub_col[start_index+j]+1)];                                           
                                            }       
            }
        }

        R2[i] = getR2(y,sub_matrix,M,size+1);


        delete [] sub_matrix;
    }
}

In the device function getR2, we have the following: 在设备函数getR2中，我们具有以下内容：

__device__
double getR2(double *y, double *X ,const unsigned int M, const unsigned int N) {

    // Initilize values
    double R2, numerator;
    double* A = new double[N*N];
    double* IA = new double[N*N];
    double* yX = new double[N];  
    // Generate all components
    XtX(X, A, M, N);
    LUPDecompose(A, N);
    LUPInvert(A, N, IA);
    yTX(y, X, yX, M, N);
    // Calc R2
    numerator = olsR2numerator(yX, IA, N);
    R2 = numerator / yTy(y, M);
    //R2 = yTy(y,M);

    delete[] A;
    delete[] IA;
    delete[] yX;

    return R2;
}

The actual kernel call is like this: 实际的内核调用是这样的：

com_ols<<<numBlocks, blockSize >>>(Y,X,R2,M,N,sub_columns, sub_size, cumulative_size, numberOfCalculations);

Currently, the kernel run time is rougly 1.4 seconds whereas on single-threaded cpu, it is 0.7 seconds. 当前，内核运行时间仅为1.4秒，而在单线程cpu上为0.7秒。 I expect the kernel run time to be much faster since it is only looping many iterations of matrix operations which should be appropiate for gpu. 我希望内核运行时间会更快，因为它只会循环执行矩阵操作的许多迭代，这对于gpu应该是适当的。 There is something inefficient with how memory of varying sized matrices is allocated. 如何分配大小不同的矩阵的内存有些效率低下。 What do you guys say about storing various sized matrices dynamically inside the kernel? 你们怎么说在内核内部动态存储各种大小的矩阵？ How should this be done in the most efficient way? 应该如何以最有效的方式完成？

Any other feedback on given code is appreciated. 给定代码的任何其他反馈表示赞赏。

Answer 1

It looks to me like three very simple rules of thumb are applicable here: 在我看来，以下三个非常简单的经验法则适用于此：

Dynamic memory allocation is always expensive, whatever platform you program on. 无论您在哪个平台上编程，动态内存分配总是很昂贵的。
Performant code never uses dynamic memory allocation unless it is absolutely necessary. 除非绝对必要，否则性能代码从不使用动态内存分配。
If dynamic memory allocation is absolutely necessary, pre-allocate memory and re-use it as much as possible 如果动态内存分配是绝对必要的，预分配内存，并重新使用它尽可能多地

If you look at your code, it violates all three of these concepts. 如果您看一下代码，它将违反所有这三个概念。

You clearly know (or could simply calculate) what the maximum value of sub_size is before the kernel launch. 您清楚地知道（或可以简单地计算）内核启动之前sub_size的最大值是sub_size 。 Use that a priori knowledge to your advantage -- pre-allocate heap memory for the calculations which is large enough to process the largest problem in the dataset and re-use it for the life of the thread. 利用这些先验知识可以为您带来好处-为计算预先分配堆内存，该内存足够大，可以处理数据集中最大的问题，并在线程的生命周期内重新使用它。 Your kernel could very easily look like something like this: 您的内核很容易看起来像这样：

__global__
void comb_ols(double *y, double *X, double *R2 ,const unsigned int M, 
             const unsigned int N, int* sub_col, int *sub_size, int* cumulative_size, 
             const unsigned int numberOfCalculations, const int max_size){

    int size;   
    int start_index;

    int index = blockIdx.x*blockDim.x+threadIdx.x;
    int stride = blockDim.x*gridDim.x;

    double *sub_matrix = new double[M*(1+max_size)];
    R2scratch temp(1+max_size);

    for(int i = index; i < numberOfCalculations; i+=stride){    

        size = sub_size[i];
        start_index = cumulative_size[i];             
        for(int j = 0; j < size; j++){
            for(int k  = 0; k<M; k++){
                sub_matrix[k] = 1;
                sub_matrix[k + M * (1 +  j)] = X[k + M * (sub_col[start_index+j]+1)];                                           
                                            }       
            }
        }
        R2[i] = getR2(y,sub_matrix,M,size+1,temp);
    }
    delete [] sub_matrix;
}

and the device function something like this: 设备功能如下：

struct R2scratch
{
    double* A;
    double* IA;
    double* yX;  

    __device__
    R2scratch(int N) {
        A = new double[N*N];
        IA = new double[N*N];
        yX = new double[N];  
    };

    __device__
    ~R2scratch() {
        delete[] A;
        delete[] IA;
        delete[] yX;
    };
};

__device__
double getR2(double *y, double *X ,const unsigned int M, const unsigned int N, 
             R2scratch &scratch) {

    // Initilize values
    double R2, numerator;
    double* A = scratch.A;
    double* IA = scratch.IA;
    double* yX = scratch.yX;

    // Generate all components
    XtX(X, A, M, N);
    LUPDecompose(A, N);
    LUPInvert(A, N, IA);
    yTX(y, X, yX, M, N);
    // Calc R2
    numerator = olsR2numerator(yX, IA, N);
    R2 = numerator / yTy(y, M);
    //R2 = yTy(y,M);

    return R2;
}

[Code obviously written in browser, never compiled and tests, use at own risk]. [代码显然是在浏览器中编写的，从未编译和测试，使用风险自负]。

By doing this you amortize the cost of a one time memory allocation over many calculations, which should be much more efficient that your current approach. 通过这样做，您可以在许多计算中分摊一次内存分配的成本，这应该比当前方法更有效。

减少cuda内核运行时：内核中矩阵的动态内存分配

问题描述

1 个解决方案

解决方案1
2

减少cuda内核运行时：内核中矩阵的动态内存分配

问题描述

1 个解决方案

解决方案1 2

解决方案1
2