Cuda 复杂 object 设备内初始化：cudaDeviceSetLimit 问题

Question

I am trying to initialize complex objects within my device, within threads and within blocks.我正在尝试在我的设备内、线程内和块内初始化复杂对象。 It seems to me I have a problem with the cudaDeviceSetLimit .在我看来， cudaDeviceSetLimit有问题。 Given my understanding of the problem, I am not setting correctly the heap memory amount per thread.鉴于我对问题的理解，我没有正确设置每个线程的堆 memory 数量。 This part of the documentation refers to my problem.这部分文档是指我的问题。 But they do not initialize an object.但他们没有初始化 object。 I have also read this post but wasn't able to get my code working.我也读过这篇文章，但无法让我的代码正常工作。

EDIT Contrarily to the first answer: doing this inside the kernel is a must in my problem configuration, because I want to take the advantage of initializing parallely the objects accross blocks编辑与第一个答案相反：在我的问题配置中必须在 kernel 内执行此操作，因为我想利用跨块并行初始化对象的优势

I have made the following toy example which works for a low number of blocks (65) but not for 65535 blocks (the maximum amount of blocks I could use on my device):我制作了以下玩具示例，它适用于少量块（65），但不适用于 65535 个块（我可以在我的设备上使用的最大块数）：

class NNode{

    public:

        int node_id;
};

class cuNetwork{

    public:

        int num_allnodes;
        NNode** all_nodes; 

};

__global__ void mallocTest(int num_allnodes, cuNetwork** arr_gpu_net){

    int bId = blockIdx.x; 
    cuNetwork* gpu_net  = new cuNetwork(); 
    gpu_net->all_nodes = new NNode*[num_allnodes];

    for(int i=0; i<num_allnodes; i++){

            gpu_net->all_nodes[i] = new NNode();
    }

    arr_gpu_net[bId] = gpu_net;

}

int main(int argc, const char **argv){

    int numBlocks = 65; 
    int num_allnodes = 200; 

    cuNetwork** arr_gpu_net = new cuNetwork*[numBlocks];
    cudaMalloc((void **)&arr_gpu_net, sizeof(cuNetwork*) * numBlocks);

    size_t size; 
    //for each block
    size = sizeof(cuNetwork);//new cuNetwork()
    size += sizeof(NNode*) * num_allnodes;//new NNode*[num_allnodes] 
    size += sizeof(NNode) * num_allnodes; //for()... new NNode()
    
    //size = sizeof(cuNetwork) + (sizeof(int) * 2 + sizeof(NNode)) * num_allnodes;
    cudaDeviceSetLimit(cudaLimitMallocHeapSize, numBlocks * size);
    mallocTest<<<numBlocks, 1>>>(num_allnodes, arr_gpu_net);

    cudaDeviceSynchronize();

    return 0;

}

As soon as I start adding additional properties to the objects, or if I increase numBlocks to 65535, I get the error:一旦我开始向对象添加其他属性，或者如果我将numBlocks增加到 65535，我就会收到错误消息：

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555555efff90

Thread 1 "no_fun" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (7750,0,0), thread (0,0,0), device 0, sm 1, warp 3, lane 0]
0x0000555555f000b0 in mallocTest(int, cuNetwork**)<<<(65535,1,1),(1,1,1)>>> ()

My question is: in this example, how should I initialize properly cudaDeviceSetLimit in order to have the correct amount of memory needed for the initialization per thread of cuNetwork ?我的问题是：在这个例子中，我应该如何正确初始化cudaDeviceSetLimit以获得正确数量的 memory 用于cuNetwork的每个线程的初始化？ Any hint would be highly appreciated.任何提示将不胜感激。 Thank you very much for your help.非常感谢您的帮助。

Answer 1

To answer your question:要回答您的问题：

Due to memory padding and allocation granularity, each block probably requires more memory than the calculated size.由于 memory 填充和分配粒度，每个块可能需要比计算大小更多的 memory。 You should always check the return value of new .您应该始终检查new的返回值。 If it is nullptr , the allocation failed.如果是nullptr ，则分配失败。

However, if the total number of nodes for all networks is known up front, it would be more efficient to just cudaMalloc the memory for all nodes, (and all networks).但是，如果预先知道所有网络的节点总数，那么对所有节点（和所有网络）只使用cudaMalloc的 memory 会更有效。 Then, in the kernel just update the pointers accordingly.然后，在 kernel 中相应地更新指针。

Something like this:像这样的东西：

struct cuNetwork2{
    int num_allnodes;
    NNode* all_nodes;
}

__global__ void kernel(cuNetwork2* d_networks, Node* d_nodes, int numNodesPerNetwork){
   int index = ...
   d_networks[index].num_allnodes = numNodesPerNetwork;
   d_networks[index].all_nodes = d_nodes + index * numNodesperNetwork;
}

...

int numBlocks = 65; 
int num_allnodes = 200;

cuNetwork2* d_networks;
NNode* d_nodes;
cudaMalloc(&d_networks, sizeof(cuNetwork2) * numBlocks);
cudaMalloc(&d_nodes, sizeof(NNode) * numBlocks * num_allnodes);

kernel<<<>>>(d_networks, d_nodes, num_allnodes);

In this case, you don't need cudaDeviceSetLimit or in-kernel dynamic allocation.在这种情况下，您不需要 cudaDeviceSetLimit 或内核内动态分配。

Cuda 复杂 object 设备内初始化：cudaDeviceSetLimit 问题

问题描述

1 个解决方案

解决方案1
0 2021-11-18 17:53:09

Cuda 复杂 object 设备内初始化：cudaDeviceSetLimit 问题

问题描述

1 个解决方案

解决方案1 0 2021-11-18 17:53:09

解决方案1
0 2021-11-18 17:53:09