为什么两个 CUDA Streams 中的操作不重叠？

Question

My program is a pipeline, which contains multiple kernels and memcpys.我的程序是一个管道，其中包含多个内核和 memcpy。 Each task will go through the same pipeline with different input data.每个任务将通过具有不同输入数据的相同管道。 The host code will first chooses a Channel, an encapsulation of scratchpad memory and CUDA objects, when it process a task.主机代码在处理任务时将首先选择一个通道，它是暂存器内存和 CUDA 对象的封装。 And after the last stage, I will record an event then will go to process next task.在最后一个阶段之后，我会记录一个事件然后去处理下一个任务。
The main pipeline logic is in the following.主要的流水线逻辑如下。 The problem is that operations in different streams are not overlapping.问题在于不同流中的操作不会重叠。 I attached the timeline of processing 10 tasks.我附上了处理 10 个任务的时间表。 You can see none operations in streams are overlapped.您可以看到流中的所有操作都没有重叠。 For each kernel, there is 256 threads in a block and 5 blocks in a grid.对于每个内核，一个块中有 256 个线程，一个网格中有 5 个块。 All buffers used for memcpy are pinned, I am sure that I have meet thoserequirements for overlapping kernel execution and data transfers.用于 memcpy 的所有缓冲区都是固定的，我确信我已经满足重叠内核执行和数据传输的那些要求。 Can someone help me figure out the reason?有人可以帮我找出原因吗？ Thanks.谢谢。

Environment information环境信息
GPU: Tesla K40m (GK110) GPU：特斯拉 K40m (GK110)
Max Warps/SM: 64最大扭曲/SM：64
Max Thread Blocks/SM: 16最大线程块/SM：16
Max Threads/SM: 2048最大线程数/SM：2048
CUDA version: 8.0 CUDA 版本：8.0

    void execute_task_pipeline(int stage, MyTask *task, Channel *channel) {
    assert(channel->taken);
    assert(!task->finish());

    GPUParam *para = &channel->para;

    assert(para->col_num > 0);
    assert(para->row_num > 0);

    // copy vid_list to device
    CUDA_ASSERT( cudaMemcpyAsync(para->vid_list_d, task->vid_list.data(),
                sizeof(uint) * para->row_num, cudaMemcpyHostToDevice, channel->stream) );

    k_get_slot_id_list<<<WK_GET_BLOCKS(para->row_num),
        WK_CUDA_NUM_THREADS, 0, channel->stream>>>(
                vertices_d,
                para->vid_list_d,
                para->slot_id_list_d,
                config.num_buckets,
                para->row_num);

    k_get_edge_list<<<WK_GET_BLOCKS(para->row_num),
        WK_CUDA_NUM_THREADS, 0, channel->stream>>>(
                vertices_d,
                para->slot_id_list_d,
                para->edge_size_list_d,
                para->offset_list_d,
                para->row_num);

    k_calc_prefix_sum(para, channel->stream);

    k_update_result_table_k2u<<<WK_GET_BLOCKS(para->row_num),
        WK_CUDA_NUM_THREADS, 0, channel->stream>>>(
            edges_d,
            para->vid_list_d,
            para->updated_result_table_d,
            para->prefix_sum_list_d,
            para->offset_list_d,
            para->col_num,
            para->row_num);

    para->col_num += 1;
    // copy result back to host
    CUDA_ASSERT( cudaMemcpyAsync(&(channel->num_new_rows), para->prefix_sum_list_d + para->row_num - 1,
            sizeof(uint), cudaMemcpyDeviceToHost, channel->stream) );
    // copy result to host memory
    CUDA_ASSERT( cudaMemcpyAsync(channel->h_buf, para->updated_result_table_d,
                channel->num_new_rows * (para->col_num + 1), cudaMemcpyDeviceToHost, channel->stream) );

    // insert a finish event in the end of pipeline
    CUDA_ASSERT( cudaEventRecord(channel->fin_event, channel->stream) );
}

Answer 1

are you trying to overlap treatments which are during 82microsecs ?您是否试图重叠 82 微秒期间的治疗？

Since you have profiled your application, the clue can be in the big orange box between two kernel execution (wich is not readable in your image).由于您已对应用程序进行了概要分析，因此线索可能位于两个内核执行之间的橙色大框中（在您的图像中无法读取）。

If this is a synchronisation remove it.如果这是同步删除它。

If this is a trace like cudaLaunch_KernelName, try to make your treatments bigger (more datas or more computations) because you take more time sending an order to the GPU than it takes to execute it, so you can't make parallel computations into these different streams.如果这是像 cudaLaunch_KernelName 这样的跟踪，请尝试使您的处理更大（更多数据或更多计算），因为您向 GPU 发送订单所需的时间比执行它所需的时间多，因此您无法将并行计算用于这些不同的流。

为什么两个 CUDA Streams 中的操作不重叠？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-01-15 16:06:58

为什么两个 CUDA Streams 中的操作不重叠？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-01-15 16:06:58

解决方案1
1 已采纳 2019-01-15 16:06:58