為什么連續推力排序和歸約命令之間在GPU上沒有活動？

Question

請參考下面的兩個快照，顯示我的CUDA代碼的Nvidia Visual Profiler會話：

nvprof會話的快照，顯示了推力::排序和推力::減少調用執行時間線

突出顯示排序和減少調用以顯示花費的時間以及執行之間的間隔

您可以看到兩個thrust::sort()調用之間有大約70 us的差距，然后在第一個thrust::reduce()和第二個thrust::sort()之間有很大的差距。 在快照中總共可以看到大約300 us這樣的缺口。 我認為這些是“ 空閑 ”時間，也許是推力庫引入的。 無論如何，我找不到Nvidia的任何相關討論或文檔。 有人可以解釋一下為什么我會有如此明顯的“ 空閑 ”時間嗎？ 這些時間加起來占我應用程序執行時間的40％，所以對我來說是一個很大的問題！

另外，我已經測量到我編寫的連續cuda內核的調用之間的間隔大約只有3 us！

我寫了一個示例CUDA代碼，以便在此處發布：

void profileThrustSortAndReduce(const int ARR_SIZE) {
    // for thrust::reduce on first 10% of the sorted array
    const int ARR_SIZE_BY_10 = ARR_SIZE / 10;

    // generate host random arrays of float
    float* h_arr1;          cudaMallocHost((void **)&h_arr1, ARR_SIZE * sizeof(float));
    float* h_arr2;          cudaMallocHost((void **)&h_arr2, ARR_SIZE * sizeof(float));
    for (int i = 0; i < ARR_SIZE; i++) {
        h_arr1[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX)* 1000.0f;
        h_arr2[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX)* 1000.0f;
    }

    // device arrays populated
    float* d_arr1;          cudaMalloc((void **)&d_arr1, ARR_SIZE * sizeof(float));
    float* d_arr2;          cudaMalloc((void **)&d_arr2, ARR_SIZE * sizeof(float));
    cudaMemcpy(d_arr1, h_arr1, ARR_SIZE * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_arr2, h_arr2, ARR_SIZE * sizeof(float), cudaMemcpyHostToDevice);

    // start cuda profiler
    cudaProfilerStart();

    // sort the two device arrays
    thrust::sort(thrust::device, d_arr1, d_arr1 + ARR_SIZE);
    thrust::sort(thrust::device, d_arr2, d_arr2 + ARR_SIZE);

    // mean of 100 percentiles of device array
    float arr1_red_100pc_mean = thrust::reduce(thrust::device, d_arr1, d_arr1 + ARR_SIZE) / ARR_SIZE;
    // mean of smallest 10 percentiles of device array
    float arr1_red_10pc_mean = thrust::reduce(thrust::device, d_arr1, d_arr1 + ARR_SIZE_BY_10) / ARR_SIZE_BY_10;

    // mean of 100 percentiles of device array
    float arr2_red_100pc_mean = thrust::reduce(thrust::device, d_arr2, d_arr2 + ARR_SIZE) / ARR_SIZE;
    // mean of smallest 10 percentiles of device array
    float arr2_red_10pc_mean = thrust::reduce(thrust::device, d_arr2, d_arr2 + ARR_SIZE_BY_10) / ARR_SIZE_BY_10;

    // stop cuda profiler
    cudaProfilerStop();
}

此示例函數的nvprof會話快照

Answer 1

差距主要是由cudaMalloc操作引起的。 thrust::sort ，大概thrust::reduce與其活動相關的臨時存儲的分配（和釋放）。

您已經從粘貼到問題中的前兩張圖片中剪掉了時間線的這一部分，但是在第三張圖片中顯示的時間線的那部分上方，您將在“運行時API”事件探查器中找到cudaMalloc操作線。

這些cudaMalloc （和cudaFree ）操作cudaFree費時又同步。 要解決此問題，通常的建議是使用推力自定義分配器（也在此處）。 這樣，您可以在程序開始時為所需的必要大小分配一次，而不必在每次進行強制調用時都產生分配/免費的開銷。

另外，您可以瀏覽cub ，它已經為您分離了分配和處理步驟。

為什么連續推力排序和歸約命令之間在GPU上沒有活動？

問題描述

1 個解決方案

解決方案1
1 2016-11-25 18:34:00

為什么連續推力排序和歸約命令之間在GPU上沒有活動？

問題描述

1 個解決方案

解決方案1 1 2016-11-25 18:34:00

解決方案1
1 2016-11-25 18:34:00