cuda kernel 執行被 CPU 代碼延遲

Question

我無法理解以下公認的非常簡單的代碼，這是一個更復雜的項目的簡化版本，我現在花了很多時間在上面。 這段代碼將在我的系統上運行大約 2000 毫秒。 但是當我啟用該線路以使 cpu 進入睡眠 500 毫秒時，該程序將運行更長的時間，使其大約為 2500 毫秒。

我無法理解這如何符合cuda 內核相對於主機異步執行的說法？

在 Vistual Studio 2019 上運行 cuda 11.1

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <chrono>
#include <iostream>
#include <numeric>
#include <thread>

__global__ void kernel(double* val, int siz) {
    for (int i = 0; i < siz; i++) val[i] = sqrt(val[i]); //calculate square root for every value in array
}

int main() {
    auto t1 = std::chrono::high_resolution_clock::now();

    const int siz = 1'000'000; //array length
    double* val = new double[siz];
    std::iota(val, val + siz, 0.0); //fill array with 0, 1, 2,...
    double* d_val;

    cudaMalloc(&d_val, sizeof(double) * siz);
    cudaMemcpy(d_val, val, sizeof(double) * siz, cudaMemcpyDefault);
    kernel <<<1, 1 >>> (d_val, siz); //start kernel
    //std::this_thread::sleep_for(std::chrono::milliseconds(500)); //---- putting cpu to sleep also delays kernel execution?
    cudaError_t err = cudaDeviceSynchronize();
    auto t2 = std::chrono::high_resolution_clock::now();

    std::cout << "status: " << cudaGetErrorString(err) << std::endl;
    std::chrono::duration<double, std::milli> ms = t2 - t1;
    std::cout << "duration: " << ms.count() << std::endl;

    delete[] val;
}

Answer 1

我無法理解這如何符合 cuda 內核相對於主機異步執行的說法？

您正在體驗此處所述的 WDDM 命令批處理。

In a nutshell, on windows, when in the WDDM driver model, GPU commands (eg anything from the cuda runtime API, plus kernel launches) will get sent to a command queue. 每隔一段時間，根據未發布的啟發式，並且沒有提供明確的 user-controls，命令隊列將被“刷新”，即發送到 GPU，屆時（如果當前不忙）Z52F9EC21735243AD9917CDA 將開始處理這些命令。0773

因此，在 WDDM 設置中，將內核分派到命令隊列是非阻塞的（控制立即返回給 CPU 線程）。 從命令隊列到 GPU 的工作分派遵循其他一些啟發式方法。 （無論如何，kernel 的執行與宿主線程是異步的）

如果這是一個問題，您至少有幾個選擇：

在 windows 上，切換到 TCC 驅動程序 model 中的 GPU。
在 windows 上，嘗試使用鏈接答案中描述的“黑客”之一。
切換到linux

cuda kernel 執行被 CPU 代碼延遲

問題描述

1 個解決方案

解決方案1
5 已采納 2021-04-05 19:59:38

cuda kernel 執行被 CPU 代碼延遲

問題描述

1 個解決方案

解決方案1 5 已采納 2021-04-05 19:59:38

解決方案1
5 已采納 2021-04-05 19:59:38