简体   繁体   中英

OpenCV CUDA Morphology is much slower than on CPU

I am processing images of the dimension 2208x1242 from a video in a while-loop, using C++ with OpenCV.
To speed things up, I wanted to execute the operations on the GPU of my Nvidia Jetson Nano.
For the color conversion from BGR to HSV using cv::cuda::cvtColor instead of cv::cvtColor I achieve a speedup by factor 5.

Unfortunately, morphological operations are much slower on the GPU:

int num_frame = 10;
int frame = 0;

cv::Mat img;
cv::cuda::GpuMat img_gpu;

cv::Mat open_kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(11, 11));
cv::Mat close_kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(21, 21));

while (frame < num_frame){

  // load image to img
  // ...

  img_gpu.upload(img);

  cv::Ptr<cv::cuda::Filter> morph_filter_open = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, img_gpu.type(), open_kernel);
  cv::Ptr<cv::cuda::Filter> morph_filter_close = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, img_gpu.type(), close_kernel);

  morph_filter_open->apply(img_gpu, img_gpu);
  morph_filter_close->apply(img_gpu, img_gpu);

  frame++;
}

Measuring only the apply() -calls, the GPU version is about 20x slower than cv::morphologyEx on the CPU of the Jetson Nano ( 0.07s vs. 1.5s for a single frame).

nvprof shows, that most of the time is spent doing cudaDeviceSynchronize (this is for the whole program doing more things that the code sample above, but the long running operations are probably related to the morphology):

  API calls:   71.05%  17.2756s       665  25.978ms  25.730us  1.44814s  cudaDeviceSynchronize
                8.36%  2.03194s      1826  1.1128ms  34.844us  847.66ms  cudaLaunchKernel
                5.16%  1.25490s         1  1.25490s  1.25490s  1.25490s  cuCtxDestroy
                4.80%  1.16684s       544  2.1449ms  17.865us  10.378ms  cudaMallocPitch
                1.89%  460.14ms       616  746.98us  20.469us  346.82ms  cudaFree
                1.65%  401.38ms        76  5.2813ms  44.533us  19.211ms  cudaMemcpy2D
                1.45%  352.97ms        51  6.9209ms  18.803us  242.14ms  cudaMalloc
                1.42%  345.25ms         1  345.25ms  345.25ms  345.25ms  cudaFuncGetAttributes
                1.23%  299.95ms         1  299.95ms  299.95ms  299.95ms  cuCtxCreate
                1.03%  251.43ms        20  12.572ms  162.61us  103.74ms  cudaMallocManaged
                0.92%  224.67ms        13  17.283ms  32.553us  65.173ms  cudaMemcpy
                0.56%  135.48ms         1  135.48ms  135.48ms  135.48ms  cudaDeviceReset
...

I hope someone can help me figure out what the problem is!

I had the same problem, I managed to improve the performance of CUDA based morphology by some margin. Instead of creating morphology filter objects in the loop, I took out the object creation and put it outside of the image capture loop.

So the code should look like this:

int num_frame = 10;
int frame = 0;

cv::Mat img;
cv::cuda::GpuMat img_gpu;

cv::Mat open_kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(11, 11));
cv::Mat close_kernel = cv::getStructuringElement(cv::MORPH_RECT, 

cv::Size(21, 21));

// Morphology filter object creation outside the loop.
cv::Ptr<cv::cuda::Filter> morph_filter_open = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, img_gpu.type(), open_kernel);
cv::Ptr<cv::cuda::Filter> morph_filter_close = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, img_gpu.type(), close_kernel);

while (frame < num_frame){

  // load image to img
  // ...

  img_gpu.upload(img);

  morph_filter_open->apply(img_gpu, img_gpu);
  morph_filter_close->apply(img_gpu, img_gpu);

  frame++;
}

I couldn't find any way to improve the performance of the CUDA morphology filter beyond this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM