CUDA：填充主要列矩阵

Question

I am fairly new to CUDA, and I am trying to offload to the GPU some cumbersome computations I am doing for a performance-critical project. 我对CUDA相当陌生，并且我正尝试将我为性能关键型项目所做的一些繁琐的计算工作卸载到GPU。 On my computer I have two NVS 510 Graphic cards, but I am currently experimenting with one only. 在我的计算机上，我有两张NVS 510图形卡，但是我目前只在尝试其中一张。

I have some big column-major matrix (1000-5000 rows x 1-5 M columns) to be filled. 我有一些要填充的大型列矩阵（1000-5000行x 1-5 M列）。 I was so far able to write the code to fill the matrix like it were an array, and it works well for matrices of relatively small size. 到目前为止，我能够像编写数组一样编写代码来填充矩阵，并且它对于较小尺寸的矩阵也能很好地工作。

__global__ void interp_kernel(fl_type * d_matrix, fl_type* weights, [other params], 
int n_rows, int num_cols) {
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int column = index / n_rows;
   int row = index % n_rows;
   if (row > n_sim || column > num_cols) return;
   d_matrix[index] = …something(row, column,[other params]);
}

The kernel is called: 内核称为：

fl_type *res;
cudaMalloc((void**)&res, n_columns*n_rows*fl_size);
int block_size = 1024;
int num_blocks = (n_rows* n_columns + block_size - 1) / block_size;
std::cout << "num_blocks:" << num_blocks << std::endl;
interp_kernel << < num_blocks, block_size >> > (res,[other params], n_rows,n_columns);

and everything works just fine. 一切都很好。 If I change the kernel to work with 2D threads: 如果我更改内核以使用2D线程：

__global__ void interp_kernel2D(fl_type * d_matrix, fl_type* weights, [other params], 
int n_rows, int num_cols) {
int column = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int index = column* n_rows + row;
if (row > n_rows || column > num_cols) return;
   d_matrix[index] = …something(row, column,[other params]);
}

and I invoke it 我调用它

int block_size2 = 32; //each block will have block_size2*block_size2 threads
dim3 num_blocks2(block_size2, block_size2);
int x_grid = (n_columns + block_size2 - 1) / block_size2;
int y_grid = (n_rows + block_size2 - 1) / block_size2;
dim3 grid_size2(x_grid, y_grid);
interp_kernel2D <<< grid_size2, num_blocks2 >>> (res,[other params], n_rows,n_columns);

the results are all zero and CUDA returns unknown error. 结果全为零，CUDA返回未知错误。 What am I missing? 我想念什么？ the actual code, which compiles without error with VS2015 and CUDA 8.0, can be found here: https://pastebin.com/XBCVC7VV 可以在此处找到使用VS2015和CUDA 8.0编译时没有错误的实际代码： https ://pastebin.com/XBCVC7VV

Here is the code from the pastebin link: 这是pastebin链接中的代码：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <assert.h>
#include <iostream>
#include <random>
#include <chrono>
typedef float fl_type;
typedef int pos_type;
typedef std::chrono::milliseconds ms;
//declaration of the cuda function
void cuda_interpolation_function(fl_type* interp_value_back, int result_size, fl_type * grid_values, int grid_values_size, fl_type* weights, pos_type* node_map, int  total_action_number, int  interp_dim, int n_sim);

fl_type iterp_cpu(fl_type* weights, pos_type* node_map, fl_type* grid_values, int& row, int& column, int& interp_dim, int& n_sim) {
    int w_p = column*interp_dim;
    fl_type res = weights[w_p] * grid_values[row + node_map[w_p] * n_sim];
    for (int inter_point = 1; inter_point < interp_dim; inter_point++) {
        res += weights[w_p + inter_point] * grid_values[node_map[w_p + inter_point] * n_sim + row];
    }
    return res;
}


__global__ void interp_kernel(fl_type * d_matrix, fl_type* weights, pos_type* node_map, fl_type* grid_values, int interp_dim, int n_sim, int num_cols) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int column = index / n_sim;
    int row = index % n_sim;
    int w_p = column*interp_dim;
    if (row > n_sim || column > num_cols) return;
    fl_type res = weights[w_p] * grid_values[row + node_map[w_p] * n_sim];
    for (int inter_point = 1; inter_point < interp_dim; inter_point++) {
        res += weights[w_p + inter_point] * grid_values[row + node_map[w_p + inter_point] * n_sim];
    }
    d_matrix[index] = res;
}

__global__ void interp_kernel2D(fl_type * d_matrix, fl_type* weights, pos_type* node_map, fl_type* grid_values, int interp_dim, int n_sim, int num_cols) {
    int column = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int index = column*n_sim + row;
    int w_p = column*interp_dim;
    if (row > n_sim || column > num_cols) return;
    fl_type res = weights[w_p] * grid_values[row + node_map[w_p] * n_sim];
    for (int inter_point = 1; inter_point < interp_dim; inter_point++) {
        res += weights[w_p + inter_point] * grid_values[row + node_map[w_p + inter_point] * n_sim];
    }
    d_matrix[index] = res;
}

void verify(fl_type *host, fl_type *device, int size) {
    int count = 0;
    int count_zero = 0;
    for (int i = 0; i < size; i++) {
        if (host[i] != device[i]) {
            count++;
            //std::cout <<"pos: " <<i<< " CPU:" <<h[i] << ",        GPU: " << d[i] <<std::endl;
            assert(host[i] == device[i]);
            if (device[i] == 0.0)
                count_zero++;
        }
    }
    if (count) {
        std::cout << "Non matching: " << count << "out of " << size << "(" << (float(count) / size * 100) << "%)" << std::endl;
        std::cout << "Zeros returned from the device: " << count_zero <<"(" << (float(count_zero) / size * 100) << "%)" << std::endl;
    }
    else
        std::cout << "Perfect match!" << std::endl;
}

int main() {
    int fl_size = sizeof(fl_type);
    int pos_size = sizeof(pos_type);
    int dim = 5;             // range: 2-5
    int number_nodes = 5500; // range: 10.000-500.000
    int max_actions = 12;    // range: 6-200
    int n_sim = 1000;        // range: 1.000-10.000
    int interp_dim = std::pow(2, dim);
    int grid_values_size = n_sim*number_nodes;
    std::default_random_engine generator;
    std::normal_distribution<fl_type> normal_dist(0.0, 1);
    std::uniform_int_distribution<> uniform_dist(0, number_nodes - 1);

    double bit_allocated = 0;
    fl_type * grid_values;  //flattened 2d array, containing the value of the grid (n_sims x number_nodes)
    grid_values = (fl_type *)malloc(grid_values_size * fl_size);
    bit_allocated += grid_values_size * fl_size;
    for (int i = 0; i < grid_values_size; i++)
        grid_values[i] = normal_dist(generator);

    pos_type * map_node2values_start; //vector that maps each node to the first column of the result matrix regarding that done
    pos_type * map_node2values_how_many; //vector that stores how many action we have per node  
    map_node2values_start = (pos_type *)malloc(number_nodes * pos_size);
    map_node2values_how_many = (pos_type *)malloc(number_nodes * pos_size);


    bit_allocated += 2 * (number_nodes * pos_size);
    for (int i = 0; i < number_nodes; i++) {
        //each node as simply max_actions
        map_node2values_start[i] = max_actions*i;
        map_node2values_how_many[i] = max_actions;
    }

    //total number of actions, which is amount of column of the results
    int total_action_number = map_node2values_start[number_nodes - 1] + map_node2values_how_many[number_nodes - 1];

    //vector that keep tracks of the columnt to grab, and their weight in the interpolation
    fl_type* weights;
    pos_type * node_map;
    weights = (fl_type *)malloc(total_action_number*interp_dim * pos_size);
    bit_allocated += total_action_number * fl_size;
    node_map = (pos_type *)malloc(total_action_number*interp_dim * pos_size);
    bit_allocated += total_action_number * pos_size;

    //filling with random numbers
    for (int i = 0; i < total_action_number*interp_dim; i++) {
        node_map[i] = uniform_dist(generator);      // picking random column
        weights[i] = 1.0 / interp_dim;              // uniform weights
    }
    std::cout << "done filling!" << std::endl;
    std::cout << bit_allocated / 8 / 1024 / 1024 << "MB allocated" << std::endl;

    int result_size = n_sim*total_action_number;
    fl_type *interp_value_cpu;
    bit_allocated += result_size* fl_size;



    interp_value_cpu = (fl_type *)malloc(result_size* fl_size);

    auto start = std::chrono::steady_clock::now();
    for (int row = 0; row < n_sim; row++) {
        for (int column = 0; column < total_action_number; column++) {
            auto zz = iterp_cpu(weights, node_map, grid_values, row, column, interp_dim, n_sim);
            interp_value_cpu[column*n_sim + row] = zz;
        }
    }
    auto elapsed_cpu = std::chrono::steady_clock::now() - start;
    std::cout << "Crunching values on the CPU (serial): " << std::chrono::duration_cast<ms>(elapsed_cpu).count() / 1000.0 << "s" << std::endl;
    int * pp;
    cudaMalloc((void**)&pp, sizeof(int)); //initializing the device, to not affect the benchmark
    fl_type *interp_value_gpu;
    interp_value_gpu = (fl_type *)malloc(result_size* fl_size);
    start = std::chrono::steady_clock::now();
    cuda_interpolation_function(interp_value_gpu, result_size, grid_values, grid_values_size, weights, node_map, total_action_number, interp_dim, n_sim);
    auto elapsed_gpu = std::chrono::steady_clock::now() - start;
    std::cout << "Crunching values on the GPU: " << std::chrono::duration_cast<ms>(elapsed_gpu).count() / 1000.0 << "s" << std::endl;
    float ms_cpu = std::chrono::duration_cast<ms>(elapsed_cpu).count();
    float ms_gpu = std::chrono::duration_cast<ms>(elapsed_gpu).count();
    int n_proc = 4;
    std::cout << "Performance: " << (ms_gpu- ms_cpu / n_proc) / (ms_cpu / n_proc) * 100 << " % less time than parallel CPU!" << std::endl;
    verify(interp_value_cpu, interp_value_gpu, result_size);

    free(interp_value_cpu);
    free(interp_value_gpu);
    free(grid_values);
    free(node_map);
    free(weights);
}

void cuda_interpolation_function(fl_type* interp_value_gpu, int result_size, fl_type * grid_values, int grid_values_size, fl_type* weights, pos_type* node_map, int total_action_number, int interp_dim, int n_sim) {
    int fl_size = sizeof(fl_type);
    int pos_size = sizeof(pos_type);
    auto start = std::chrono::steady_clock::now();
    //device versions of the inputs
    fl_type * grid_values_device;
    fl_type* weights_device;
    pos_type * node_map_device;
    fl_type *interp_value_device;
    int lenght_node_map = interp_dim*total_action_number;
    std::cout << "size grid_values: " << grid_values_size <<std::endl;
    std::cout << "size weights: " << lenght_node_map << std::endl;
    std::cout << "size interp_value: " << result_size << std::endl;

    //allocating and moving to the GPU the inputs
    auto error_code=cudaMalloc((void**)&grid_values_device, grid_values_size*fl_size);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMalloc of the grid_values" << std::endl;
    }
    error_code=cudaMemcpy(grid_values_device, grid_values, grid_values_size*fl_size, cudaMemcpyHostToDevice);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMemcpy of the grid_values" << std::endl;
    }
    error_code=cudaMalloc((void**)&weights_device, lenght_node_map*fl_size);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMalloc of the weights" << std::endl;
    }
    error_code=cudaMemcpy(weights_device, weights, lenght_node_map*fl_size, cudaMemcpyHostToDevice);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMemcpy of the weights" << std::endl;
    }
    error_code=cudaMalloc((void**)&node_map_device, lenght_node_map*pos_size);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMalloc of node_map" << std::endl;
    }
    error_code=cudaMemcpy(node_map_device, node_map, lenght_node_map*pos_size, cudaMemcpyHostToDevice);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMemcpy of node_map" << std::endl;
    }
    error_code=cudaMalloc((void**)&interp_value_device, result_size*fl_size);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMalloc of interp_value_device " << std::endl;
    }
    auto elapsed_moving = std::chrono::steady_clock::now() - start;
    float ms_moving = std::chrono::duration_cast<ms>(elapsed_moving).count();
    cudaDeviceSynchronize();
    //1d
    int block_size = 1024;
    int num_blocks = (result_size + block_size - 1) / block_size;
    std::cout << "num_blocks:" << num_blocks << std::endl;
    interp_kernel << < num_blocks, block_size >> > (interp_value_device, weights_device, node_map_device, grid_values_device, interp_dim, n_sim, total_action_number);


    //2d
    //int block_size2 = 32; //each block will have block_size2*block_size2 threads
    //dim3 num_blocks2(block_size2, block_size2);
    //int x_grid = (total_action_number + block_size2 - 1) / block_size2;
    //int y_grid = (n_sim + block_size2 - 1) / block_size2;
    //dim3 grid_size2(x_grid, y_grid);
    //std::cout <<"grid:"<< x_grid<<" x "<< y_grid<<std::endl;
    //interp_kernel2D <<< grid_size2, num_blocks2 >>> (interp_value_device, weights_device, node_map_device, grid_values_device, interp_dim, n_sim, total_action_number);


    cudaDeviceSynchronize();
    cudaError err = cudaGetLastError();
    if (cudaSuccess != err)
    {
        std::cout << "Cuda kernel failed! " << cudaGetErrorString(err) <<std::endl;
    }
    start = std::chrono::steady_clock::now();
    cudaMemcpy(interp_value_gpu, interp_value_device, result_size*fl_size, cudaMemcpyDeviceToHost);
    auto elapsed_moving_back = std::chrono::steady_clock::now() - start;
    float ms_moving_back = std::chrono::duration_cast<ms>(elapsed_moving_back).count();

    std::cout << "Time spent moving the data to the GPU:" << ms_moving << " ms"<<std::endl;
    std::cout << "Time spent moving the results back to the host: " << ms_moving_back << " ms" << std::endl;

    cudaFree(interp_value_device);
    cudaFree(weights_device);
    cudaFree(node_map_device);
    cudaFree(grid_values_device);
}

Moreover, I would extremely grateful for any direction on how to improve the performance of the code. 而且，我将非常感谢任何关于如何提高代码性能的指导。

Answer 1

Any time you are having trouble with a CUDA code, I recommend doing proper CUDA error checking (which you mostly seem to be doing), and also run your code with cuda-memcheck . 任何时候你有麻烦了CUDA代码，我建议做适当的CUDA错误检查（你大多似乎在做），并运行你的代码的时间cuda-memcheck 。 This last utility is similar to "enabling the memory checker" in Nsight VSE, but not quite the same. 最后一个实用程序与Nsight VSE中的“启用内存检查器”相似，但并不完全相同。 However the Nsight VSE memory checker may have given you the same indication. 但是，Nsight VSE内存检查器可能已经给出了相同的指示。

In C (or C++) indexing of arrays generally starts at 0. Therefore, to test for an out-of-bounds index, I must check to see if the generated index is equal to or greater than the size of the array. 在C（或C ++）中，数组的索引通常从0开始。因此，要测试越界索引，我必须检查生成的索引是否等于或大于数组的大小。 But in your case you are only testing for greater than: 但是在您的情况下，您仅测试以下项目：

if (row > n_sim || column > num_cols) return;

You make a similar error in both your 1D kernel and in your 2D kernel, and although you believe your 1D kernel is working correctly, it is actually making out-of-bounds accesses. 您在1D内核和2D内核中都犯了类似的错误，尽管您认为1D内核可以正常工作，但实际上是在进行越界访问。 You can verify this if you run with the aforementioned cuda-memcheck utility (or probably also with the memory checker that can be enabled in Nsight VSE). 如果您使用上述的cuda-memcheck实用程序（或可能还可以使用可以在Nsight VSE中启用的内存检查器）运行，则可以验证这一点。

When I modify your code in the pastebin link to use proper range/bounds checking, cuda-memcheck reports no errors, and your program reports the correct results. 当我在pastebin链接中修改您的代码以使用正确的范围/边界检查时， cuda-memcheck报告没有错误，并且您的程序报告了正确的结果。 I've tested both cases, but the code below is modified from your pastebin link to uncomment the 2D case, and use that instead of the 1D case: 我已经测试了这两种情况，但是下面的代码已从您的pastebin链接修改为取消注释2D情况，并使用它代替了1D情况：

$ cat t375.cu | more
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <assert.h>
#include <iostream>
#include <random>
#include <chrono>
typedef float fl_type;
typedef int pos_type;
typedef std::chrono::milliseconds ms;
//declaration of the cuda function
void cuda_interpolation_function(fl_type* interp_value_back, int result_size, fl
_type * grid_values, int grid_values_size, fl_type* weights, pos_type* node_map,
 int  total_action_number, int  interp_dim, int n_sim);

fl_type iterp_cpu(fl_type* weights, pos_type* node_map, fl_type* grid_values, in
t& row, int& column, int& interp_dim, int& n_sim) {
    int w_p = column*interp_dim;
    fl_type res = weights[w_p] * grid_values[row + node_map[w_p] * n_sim];
    for (int inter_point = 1; inter_point < interp_dim; inter_point++) {
        res += weights[w_p + inter_point] * grid_values[node_map[w_p + inter_poi
nt] * n_sim + row];
    }
    return res;
}


__global__ void interp_kernel(fl_type * d_matrix, fl_type* weights, pos_type* no
de_map, fl_type* grid_values, int interp_dim, int n_sim, int num_cols) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int column = index / n_sim;
    int row = index % n_sim;
    int w_p = column*interp_dim;
    if (row >= n_sim || column >= num_cols) return;  // modified
    fl_type res = weights[w_p] * grid_values[row + node_map[w_p] * n_sim];
    for (int inter_point = 1; inter_point < interp_dim; inter_point++) {
        res += weights[w_p + inter_point] * grid_values[row + node_map[w_p + int
er_point] * n_sim];
    }
    d_matrix[index] = res;
}

__global__ void interp_kernel2D(fl_type * d_matrix, fl_type* weights, pos_type*
node_map, fl_type* grid_values, int interp_dim, int n_sim, int num_cols) {
    int column = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int index = column*n_sim + row;
    int w_p = column*interp_dim;
    if (row >= n_sim || column >= num_cols) return;  // modified
    fl_type res = weights[w_p] * grid_values[row + node_map[w_p] * n_sim];
    for (int inter_point = 1; inter_point < interp_dim; inter_point++) {
        res += weights[w_p + inter_point] * grid_values[row + node_map[w_p + int
er_point] * n_sim];
    }
    d_matrix[index] = res;
}

void verify(fl_type *host, fl_type *device, int size) {
    int count = 0;
    int count_zero = 0;
    for (int i = 0; i < size; i++) {
        if (host[i] != device[i]) {
            count++;
            //std::cout <<"pos: " <<i<< " CPU:" <<h[i] << ",        GPU: " << d[
i] <<std::endl;
            assert(host[i] == device[i]);
            if (device[i] == 0.0)
                count_zero++;
        }
    }
    if (count) {
        std::cout << "Non matching: " << count << "out of " << size << "(" << (f
loat(count) / size * 100) << "%)" << std::endl;
        std::cout << "Zeros returned from the device: " << count_zero <<"(" << (
float(count_zero) / size * 100) << "%)" << std::endl;
    }
    else
        std::cout << "Perfect match!" << std::endl;
}

int main() {
    int fl_size = sizeof(fl_type);
    int pos_size = sizeof(pos_type);
    int dim = 5;             // range: 2-5
    int number_nodes = 5500; // range: 10.000-500.000
    int max_actions = 12;    // range: 6-200
    int n_sim = 1000;        // range: 1.000-10.000
    int interp_dim = std::pow(2, dim);
    int grid_values_size = n_sim*number_nodes;
    std::default_random_engine generator;
    std::normal_distribution<fl_type> normal_dist(0.0, 1);
    std::uniform_int_distribution<> uniform_dist(0, number_nodes - 1);

    double bit_allocated = 0;
    fl_type * grid_values;  //flattened 2d array, containing the value of the grid (n_sims x number_nodes)
    grid_values = (fl_type *)malloc(grid_values_size * fl_size);
    bit_allocated += grid_values_size * fl_size;
    for (int i = 0; i < grid_values_size; i++)
        grid_values[i] = normal_dist(generator);

    pos_type * map_node2values_start; //vector that maps each node to the first column of the result matrix regarding that done
    pos_type * map_node2values_how_many; //vector that stores how many action we have per node
    map_node2values_start = (pos_type *)malloc(number_nodes * pos_size);
    map_node2values_how_many = (pos_type *)malloc(number_nodes * pos_size);


    bit_allocated += 2 * (number_nodes * pos_size);
    for (int i = 0; i < number_nodes; i++) {
        //each node as simply max_actions
        map_node2values_start[i] = max_actions*i;
        map_node2values_how_many[i] = max_actions;
    }

    //total number of actions, which is amount of column of the results
    int total_action_number = map_node2values_start[number_nodes - 1] + map_node2values_how_many[number_nodes - 1];

    //vector that keep tracks of the columnt to grab, and their weight in the interpolation
    fl_type* weights;
    pos_type * node_map;
    weights = (fl_type *)malloc(total_action_number*interp_dim * pos_size);
    bit_allocated += total_action_number * fl_size;
    node_map = (pos_type *)malloc(total_action_number*interp_dim * pos_size);
    bit_allocated += total_action_number * pos_size;

    //filling with random numbers
    for (int i = 0; i < total_action_number*interp_dim; i++) {
        node_map[i] = uniform_dist(generator);      // picking random column
        weights[i] = 1.0 / interp_dim;              // uniform weights
    }
    std::cout << "done filling!" << std::endl;
    std::cout << bit_allocated / 8 / 1024 / 1024 << "MB allocated" << std::endl;

    int result_size = n_sim*total_action_number;
    fl_type *interp_value_cpu;
    bit_allocated += result_size* fl_size;



    interp_value_cpu = (fl_type *)malloc(result_size* fl_size);

    auto start = std::chrono::steady_clock::now();
    for (int row = 0; row < n_sim; row++) {
        for (int column = 0; column < total_action_number; column++) {
            auto zz = iterp_cpu(weights, node_map, grid_values, row, column, interp_dim, n_sim);
            interp_value_cpu[column*n_sim + row] = zz;
        }
    }
    auto elapsed_cpu = std::chrono::steady_clock::now() - start;
    std::cout << "Crunching values on the CPU (serial): " << std::chrono::duration_cast<ms>(elapsed_cpu).count() / 1000.0 << "s" << std::endl;
    int * pp;
    cudaMalloc((void**)&pp, sizeof(int)); //initializing the device, to not affect the benchmark
    fl_type *interp_value_gpu;
    interp_value_gpu = (fl_type *)malloc(result_size* fl_size);
    start = std::chrono::steady_clock::now();
    cuda_interpolation_function(interp_value_gpu, result_size, grid_values, grid_values_size, weights, node_map, total_action_number, interp_dim, n_sim);
    auto elapsed_gpu = std::chrono::steady_clock::now() - start;
    std::cout << "Crunching values on the GPU: " << std::chrono::duration_cast<ms>(elapsed_gpu).count() / 1000.0 << "s" << std::endl;
    float ms_cpu = std::chrono::duration_cast<ms>(elapsed_cpu).count();
    float ms_gpu = std::chrono::duration_cast<ms>(elapsed_gpu).count();
    int n_proc = 4;
    std::cout << "Performance: " << (ms_gpu- ms_cpu / n_proc) / (ms_cpu / n_proc) * 100 << " % less time than parallel CPU!" << std::endl;
    verify(interp_value_cpu, interp_value_gpu, result_size);

    free(interp_value_cpu);
    free(interp_value_gpu);
    free(grid_values);
    free(node_map);
    free(weights);
}

void cuda_interpolation_function(fl_type* interp_value_gpu, int result_size, fl_type * grid_values, int grid_values_size, fl_type* weights, pos_type* node_map, int total_action_number, int interp_dim, int n_sim) {
    int fl_size = sizeof(fl_type);
    int pos_size = sizeof(pos_type);
    auto start = std::chrono::steady_clock::now();
    //device versions of the inputs
    fl_type * grid_values_device;
    fl_type* weights_device;
    pos_type * node_map_device;
    fl_type *interp_value_device;
    int lenght_node_map = interp_dim*total_action_number;
    std::cout << "size grid_values: " << grid_values_size <<std::endl;
    std::cout << "size weights: " << lenght_node_map << std::endl;
    std::cout << "size interp_value: " << result_size << std::endl;

    //allocating and moving to the GPU the inputs
    auto error_code=cudaMalloc((void**)&grid_values_device, grid_values_size*fl_size);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMalloc of the grid_values" << std::endl;
    }
    error_code=cudaMemcpy(grid_values_device, grid_values, grid_values_size*fl_size, cudaMemcpyHostToDevice);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMemcpy of the grid_values" << std::endl;
    }
    error_code=cudaMalloc((void**)&weights_device, lenght_node_map*fl_size);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMalloc of the weights" << std::endl;
    }
    error_code=cudaMemcpy(weights_device, weights, lenght_node_map*fl_size, cudaMemcpyHostToDevice);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMemcpy of the weights" << std::endl;
    }
    error_code=cudaMalloc((void**)&node_map_device, lenght_node_map*pos_size);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMalloc of node_map" << std::endl;
    }
    error_code=cudaMemcpy(node_map_device, node_map, lenght_node_map*pos_size, cudaMemcpyHostToDevice);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMemcpy of node_map" << std::endl;
    }
    error_code=cudaMalloc((void**)&interp_value_device, result_size*fl_size);
    if (error_code != cudaSuccess) {
        std::cout << "Error during cudaMalloc of interp_value_device " << std::endl;
    }
    auto elapsed_moving = std::chrono::steady_clock::now() - start;
    float ms_moving = std::chrono::duration_cast<ms>(elapsed_moving).count();
    cudaDeviceSynchronize();
    //1d
#if 0
    int block_size = 1024;
    int num_blocks = (result_size + block_size - 1) / block_size;
    std::cout << "num_blocks:" << num_blocks << std::endl;
    interp_kernel << < num_blocks, block_size >> > (interp_value_device, weights_device, node_map_device, grid_values_device, interp_dim, n_sim, total_action_number);
#endif

    //2d
    int block_size2 = 32; //each block will have block_size2*block_size2 threads
    dim3 num_blocks2(block_size2, block_size2);
    int x_grid = (total_action_number + block_size2 - 1) / block_size2;
    int y_grid = (n_sim + block_size2 - 1) / block_size2;
    dim3 grid_size2(x_grid, y_grid);
    std::cout <<"grid:"<< x_grid<<" x "<< y_grid<<std::endl;
    interp_kernel2D <<< grid_size2, num_blocks2 >>> (interp_value_device, weights_device, node_map_device, grid_values_device, interp_dim, n_sim, total_action_number);


    cudaDeviceSynchronize();
    cudaError err = cudaGetLastError();
    if (cudaSuccess != err)
    {
        std::cout << "Cuda kernel failed! " << cudaGetErrorString(err) <<std::endl;
    }
    start = std::chrono::steady_clock::now();
    cudaMemcpy(interp_value_gpu, interp_value_device, result_size*fl_size, cudaMemcpyDeviceToHost);
    auto elapsed_moving_back = std::chrono::steady_clock::now() - start;
    float ms_moving_back = std::chrono::duration_cast<ms>(elapsed_moving_back).count();

    std::cout << "Time spent moving the data to the GPU:" << ms_moving << " ms"<<std::endl;
    std::cout << "Time spent moving the results back to the host: " << ms_moving_back << " ms" << std::endl;

    cudaFree(interp_value_device);
    cudaFree(weights_device);
    cudaFree(node_map_device);
    cudaFree(grid_values_device);
}
$ nvcc -arch=sm_52 -o t375 t375.cu -std=c++11
$ cuda-memcheck ./t375
========= CUDA-MEMCHECK
done filling!
2.69079MB allocated
Crunching values on the CPU (serial): 30.081s
size grid_values: 5500000
size weights: 2112000
size interp_value: 66000000
grid:2063 x 32
Time spent moving the data to the GPU:31 ms
Time spent moving the results back to the host: 335 ms
Crunching values on the GPU: 7.089s
Performance: -5.73452 % less time than parallel CPU!
Perfect match!
========= ERROR SUMMARY: 0 errors
$

Note that cuda-memcheck slows down the execution of your program on the GPU to do rigorous memory bounds checking. 请注意， cuda-memcheck会减慢您的程序在GPU上的执行速度，以进行严格的内存边界检查。 Therefore the performance may not match the ordinary case. 因此，性能可能与普通情况不符。 This is what an "ordinary" run looks like: 这是“常规”运行的样子：

$ ./t375
done filling!
2.69079MB allocated
Crunching values on the CPU (serial): 30.273s
size grid_values: 5500000
size weights: 2112000
size interp_value: 66000000
grid:2063 x 32
Time spent moving the data to the GPU:32 ms
Time spent moving the results back to the host: 332 ms
Crunching values on the GPU: 1.161s
Performance: -84.6596 % less time than parallel CPU!
Perfect match!
$

Answer 2

You are accessing memory beyond the allocated chunk. 您正在访问分配的块之外的内存。 To check if row and column indices are within the range: 要检查行索引和列索引是否在范围内：

if (row >= n_rows || column >= num_cols) return;      // Do this
if (row >  n_rows || column >  num_cols) return;      // Instead of this

In flat version this int row = index % n_rows; 在平面版本中，此int row = index % n_rows; makes row stay below the n_rows . 使row保持在n_rows以下。 You only access one column beyond the allocated memory, which for small matrix could still be withing the memory alignment. 您只能访问已分配内存之外的一列，对于较小的矩阵，该列可能仍与内存对齐有关。 Python demo . Python演示。

The second version does access an extra column plus and extra element, and one extra element for each row (the first element of the following row), as this: 第二个版本确实访问了一个额外的列plus和extra元素，以及每行一个额外的元素（下一行的第一个元素），如下所示：

int row = blockIdx.y * blockDim.y + threadIdx.y;

no longer keeps row index within the valid range. 行索引不再保持在有效范围内。 Python demo . Python演示。

Looking at your pastebin, this is probably the place where it breaks: 查看您的pastebin，可能是它破裂的地方：

44.   fl_type res = weights[w_p] * grid_values[row + node_map[w_p] * n_sim];

                                               ^^^

45.   for (int inter_point = 1; inter_point < interp_dim; inter_point++) {
46.       res += weights[w_p + inter_point] * \
           grid_values[row + node_map[w_p + inter_point] * n_sim];

                       ^^^
47.   }

CUDA：填充主要列矩阵

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-07-30 14:55:41

解决方案2
1

CUDA：填充主要列矩阵

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-07-30 14:55:41

解决方案2 1

解决方案1
1 已采纳 2017-07-30 14:55:41

解决方案2
1