為什么Opencv GPU代碼比CPU慢？

Question

我正在使用筆記本電腦的opencv242 + VS2010。
我試圖對OpenCV中的GPU塊進行一些簡單測試，但它表明GPU比CPU代碼慢100倍。 在這段代碼中，我只是將彩色圖像轉換為灰度圖像，使用cvtColor的功能

這是我的代碼，PART1是CPU代碼（測試cpu RGB2GRAY），PART2是GPU上傳圖像，PART3是GPU RGB2GRAY，PART4是CPU RGB2GRAY。 有三件事讓我如此疑惑：

1在我的代碼中，part1是0.3ms，而part4（與part1完全相同）是40ms！
2將圖像上傳到GPU的part2是6000ms !!!
3 Part3（GPU代碼）是11ms，對於這個簡單的圖像來說速度太慢了！

    #include "StdAfx.h"
    #include <iostream>
    #include "opencv2/opencv.hpp"
    #include "opencv2/gpu/gpu.hpp"
    #include "opencv2/gpu/gpumat.hpp"
    #include "opencv2/core/core.hpp"
    #include "opencv2/highgui/highgui.hpp"
    #include <cuda.h>
    #include <cuda_runtime_api.h>
    #include <ctime>
    #include <windows.h>

    using namespace std;
    using namespace cv;
    using namespace cv::gpu;

    int main()
    {
        LARGE_INTEGER freq;
        LONGLONG QPart1,QPart6;
        double dfMinus, dfFreq, dfTim;
        QueryPerformanceFrequency(&freq);
        dfFreq = (double)freq.QuadPart;

        cout<<getCudaEnabledDeviceCount()<<endl;
        Mat img_src = imread("d:\\CUDA\\train.png", 1);

        // PART1 CPU code~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        // From color image to grayscale image.
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;
        Mat img_gray;
        cvtColor(img_src,img_gray,CV_BGR2GRAY);
        QueryPerformanceCounter(&freq);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("CPU RGB2GRAY running time is %.2f ms\n\n",dfTim);

        // PART2 GPU upload image~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        GpuMat gimg_src;
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;
        gimg_src.upload(img_src);
        QueryPerformanceCounter(&freq);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("Read image running time is %.2f ms\n\n",dfTim);

        GpuMat dst1;
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;

        /*dst.upload(src_host);*/
        dst1.upload(imread("d:\\CUDA\\train.png", 1));

        QueryPerformanceCounter(&freq);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("Read image running time 2 is %.2f ms\n\n",dfTim);

        // PART3~ GPU code~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        // gpuimage From color image to grayscale image.
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;

        GpuMat gimg_gray;
        gpu::cvtColor(gimg_src,gimg_gray,CV_BGR2GRAY);

        QueryPerformanceCounter(&freq);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("GPU RGB2GRAY running time is %.2f ms\n\n",dfTim);

        // PART4~CPU code(again)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        // gpuimage From color image to grayscale image.
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;
        Mat img_gray2;
        cvtColor(img_src,img_gray2,CV_BGR2GRAY);
        BOOL i_test=QueryPerformanceCounter(&freq);
        printf("%d \n",i_test);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("CPU RGB2GRAY running time is %.2f ms\n\n",dfTim);

        cvWaitKey();
        getchar();
        return 0;
    }

Answer 1

cvtColor沒有做太多工作，灰色你只需要平均三個數字。

CPU上的cvColor代碼使用SSE2指令一次處理多達8個像素，如果你有TBB它使用所有內核/超線程，CPU運行速度是GPU時鍾速度的10倍，最后你不必將數據復制到GPU上並返回。

Answer 2

上面的大多數答案實際上都是錯 它之所以慢了20000倍，當然不是因為“CPU時鍾速度更快”而且“它必須將其復制到GPU”（接受的答案）。 這些都是因素，但是你說你省略了一個事實，即對於一個令人作嘔的並行問題你有更多的計算能力。 說20.000x的性能差異是因為后者只是如此荒謬可笑。 作者在這里知道一些錯誤並不是直截了當的。 解：

你的問題是CUDA需要初始化！ 它將始終為第一張圖像初始化，通常需要1-10秒，具體取決於木星和火星的對齊方式。 現在嘗試一下。 計算兩次，然后計算兩者。 在這種情況下，您可能會看到速度與magnutide相同，而不是20.000x，這太荒謬了。 你能對這個初始化做些什么嗎？ 不，不是我所知道的。 這是一個障礙。

編輯：我剛剛重新閱讀帖子。 你說你在筆記本上運行。 那些經常有破舊的GPU，而CPU則有一個公平的渦輪增壓。

Answer 3

嘗試不止一次運行....

-----------摘自http://opencv.willowgarage.com/wiki/OpenCV%20GPU%20FAQ Perfomance

為什么第一次函數調用很慢？

那是因為初始化開銷。 在第一個GPU函數調用Cuda Runtime API被隱式初始化。 在第一次使用時，還會為您的視頻卡編譯一些GPU代碼（即時編譯）。 因此，對於性能測量，有必要進行虛函數調用，然后才執行時間測試。

如果應用程序僅運行一次GPU代碼至關重要，則可以使用在多次運行中持久的編譯緩存。 有關詳細信息，請閱讀nvcc文檔（CUDA_DEVCODE_CACHE環境變量）。

Answer 4

cvtColour是一個小型操作，主機（CPU）和設備（GPU）之間的內存傳輸時間遠遠超過了在GPU上實現的任何性能提升。 最小化此內存傳輸的延遲是任何GPU計算的主要挑戰。

Answer 5

你有什么GPU？

檢查計算兼容性，也許是原因。

https://developer.nvidia.com/cuda-gpus

這意味着對於具有CC 1.3和2.0二進制映像的設備已准備好運行。 對於所有較新的平台，1.3的PTX代碼是JIT到二進制圖像。 對於具有CC 1.1和1.2的設備，PTX for 1.1是JIT。 對於具有CC 1.0的設備，沒有可用的代碼，並且函數拋出異常。 對於首先執行JIT編譯的平台，運行速度很慢。

http://docs.opencv.org/modules/gpu/doc/introduction.html

為什么Opencv GPU代碼比CPU慢？

問題描述

5 個解決方案

解決方案1
25 已采納 2012-08-22 13:55:37

解決方案2
22 2015-06-07 12:40:16

解決方案3
6 2012-09-24 05:36:31

解決方案4
1 2013-06-12 03:17:59

解決方案5
0 2013-04-16 13:22:52

為什么Opencv GPU代碼比CPU慢？

問題描述

5 個解決方案

解決方案1 25 已采納 2012-08-22 13:55:37

解決方案2 22 2015-06-07 12:40:16

解決方案3 6 2012-09-24 05:36:31

解決方案4 1 2013-06-12 03:17:59

解決方案5 0 2013-04-16 13:22:52

解決方案1
25 已采納 2012-08-22 13:55:37

解決方案2
22 2015-06-07 12:40:16

解決方案3
6 2012-09-24 05:36:31

解決方案4
1 2013-06-12 03:17:59

解決方案5
0 2013-04-16 13:22:52