我的 GPU 加速 opencv 代碼比普通 opencv 慢

Question

我從《Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA》一書中復制了兩個例子來比較 CPU 和 GPU 的性能。

第一個代碼：

    cv::Mat src = cv::imread("D:/Pics/Pen.jpg", 0); // Pen.jpg is a 4096 * 4096 GrayScacle picture.
    cv::Mat result_host1, result_host2, result_host3, result_host4, result_host5;

    //Get initial time in miliseconds
    int64 work_begin = getTickCount();
    cv::threshold(src, result_host1, 128.0, 255.0, cv::THRESH_BINARY);
    cv::threshold(src, result_host2, 128.0, 255.0, cv::THRESH_BINARY_INV);
    cv::threshold(src, result_host3, 128.0, 255.0, cv::THRESH_TRUNC);
    cv::threshold(src, result_host4, 128.0, 255.0, cv::THRESH_TOZERO);
    cv::threshold(src, result_host5, 128.0, 255.0, cv::THRESH_TOZERO_INV);

    //Get time after work has finished     
    int64 delta = getTickCount() - work_begin;
    //Frequency of timer
    double freq = getTickFrequency();
    double work_fps = freq / delta;
    std::cout << "Performance of Thresholding on CPU: " << std::endl;
    std::cout << "Time: " << (1 / work_fps) << std::endl;
    std::cout << "FPS: " << work_fps << std::endl;
    return 0;

第二個代碼：

    cv::Mat h_img1 = cv::imread("D:/Pics/Pen.jpg", 0);  // Pen.jpg is a 4096 * 4096 GrayScacle picture.
    cv::cuda::GpuMat d_result1, d_result2, d_result3, d_result4, d_result5, d_img1;
    //Measure initial time ticks
    int64 work_begin = getTickCount();
    d_img1.upload(h_img1);
    cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
    cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
    cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
    cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
    cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);

    cv::Mat h_result1, h_result2, h_result3, h_result4, h_result5;
    d_result1.download(h_result1);
    d_result2.download(h_result2);
    d_result3.download(h_result3);
    d_result4.download(h_result4);
    d_result5.download(h_result5);
    //Measure difference in time ticks
    int64 delta = getTickCount() - work_begin;
    double freq = getTickFrequency();
    //Measure frames per second
    double work_fps = freq / delta;
    std::cout << "Performance of Thresholding on GPU: " << std::endl;
    std::cout << "Time: " << (1 / work_fps) << std::endl;
    std::cout << "FPS: " << work_fps << std::endl;
    return 0;

一切正常，除了：

“GPU 的速度低於 CPU”

第一個結果：

    Performance of Thresholding on CPU:
    Time: 0.0475497 
    FPS: 21.0306

第二個結果：

    Performance of Thresholding on GPU:
    Time: 0.599032
    FPS: 1.66936

然后，我決定取消上傳和下載時間：

第三個代碼：

    cv::Mat h_img1 = cv::imread("D:/Pics/Pen.jpg", 0);  // Pen.jpg is a 4096 * 4096 GrayScacle picture.
    cv::cuda::GpuMat d_result1, d_result2, d_result3, d_result4, d_result5, d_img1;
    d_img1.upload(h_img1);
    //Measure initial time ticks
    int64 work_begin = getTickCount();
    cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
    cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
    cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
    cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
    cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);
    //Measure difference in time ticks
    int64 delta = getTickCount() - work_begin;
    double freq = getTickFrequency();
    //Measure frames per second
    double work_fps = freq / delta;
    std::cout << "Performance of Thresholding on GPU: " << std::endl;
    std::cout << "Time: " << (1 / work_fps) << std::endl;
    std::cout << "FPS: " << work_fps << std::endl;

    cv::Mat h_result1, h_result2, h_result3, h_result4, h_result5;
    d_result1.download(h_result1);
    d_result2.download(h_result2);
    d_result3.download(h_result3);
    d_result4.download(h_result4);
    d_result5.download(h_result5);
    return 0;

但是，問題一直存在：

第三個結果：

Performance of Thresholding on GPU: 
Time: 0.136095
FPS: 7.34779

我對這個問題感到困惑。

         1st         2nd         3rd
         CPU         GPU         GPU
Time: 0.0475497   0.599032    0.136095
FPS:  21.0306     1.66936     7.34779

請幫我。

GPU規格：

*********************************************************
NVIDIA Quadro K2100M

Micro architecture: Kepler

Compute capability version: 3.0

CUDA Version: 10.1
*********************************************************

我的系統規格：

*********************************************************
laptop hp ZBook

CPU: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz 2.90 GHZ

RAM: 8.00 GB

OS: Windows 7, 64-bit, Ultimate, Service Pack 1
*********************************************************

Answer 1

即使沒有內存操作，我能想到 CPU 版本更快的兩個原因：

1.在第 2 和第 3 代碼版本中，您聲明了結果 GpuMats 但實際上並未對其進行初始化，結果 GpuMats 的初始化將通過調用 GpuMat.create 在閾值方法內進行，這會導致 80MB 的 GPU 內存每次執行的分配，您可以通過初始化結果 GpuMats 一次然后重用它們來看到“性能改進”。 使用原始的第三個代碼，我得到以下結果（Geforce RTX 2080）：

時間： 0.010208幀率： 97.9624

當我將代碼更改為：

...
d_resut1.create(h_img1.size(), CV_8UC1);
d_result2.create(h_img1.size(), CV_8UC1);
d_result3.create(h_img1.size(), CV_8UC1);
d_result4.create(h_img1.size(), CV_8UC1);
d_result5.create(h_img1.size(), CV_8UC1);
d_img1.upload(h_img1);
//Measure initial time ticks
int64 work_begin = getTickCount();
cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);
...

我得到以下結果（好 2倍）時間： 0.00503374 FPS： 198.659

雖然 GpuMat 結果預分配帶來了重大的性能提升，但對 CPU 版本的相同修改卻沒有。

2. K2100M 不是一個非常強大的 GPU（576 核 @ 665 MHz），並且考慮到 OpenCV 可能（取決於你如何編譯它）在 CPU 引擎蓋下使用帶有 SIMD 指令的多線程（2.90GHz 與8個虛擬核）版本結果並不出人意料

編輯：通過使用 NVIDIA Nsight 系統分析應用程序，您可以更好地了解 GPU 內存操作懲罰：

如您所見，僅分配和釋放內存需要 10.5 毫秒，而閾值處理本身僅需要 5 毫秒

我的 GPU 加速 opencv 代碼比普通 opencv 慢

問題描述

1 個解決方案

解決方案1
6 2020-02-08 10:58:12

我的 GPU 加速 opencv 代碼比普通 opencv 慢

問題描述

1 個解決方案

解決方案1 6 2020-02-08 10:58:12

解決方案1
6 2020-02-08 10:58:12