简体   繁体   English

使用 OpenCL 的 GPU 比 CPU 慢。 为什么?

[英]GPU with OpenCL is slower than CPU. Why?

Environment:环境:

  • Intel i7-9750H英特尔 i7-9750H
  • Intel UHD Graphics 630英特尔超高清显卡 630
  • Nvidia GTX1050 (Laptop)英伟达 GTX1050(笔记本电脑)
  • Visual studio 2019 / C++ Visual Studio 2019 / C++
  • OpenCV 4.4 OpenCV 4.4
  • OpenCL 3.0 (intel) / 1.2 (nvidia) OpenCL 3.0(英特尔)/1.2(英伟达)

I'm trying to use OpenCL to speed up my code.我正在尝试使用 OpenCL 来加速我的代码。 But the result shows CPU is faster than GPU.但结果显示 CPU 比 GPU 快。 How could I speed up my code?我怎样才能加快我的代码?

void GetHoughLines(cv::Mat dst) {
    cv::ocl::setUseOpenCL(true);

    int img_w = dst.size().width; // 5000
    int img_h = dst.size().height; // 4000

    cv::UMat tmp_dst = dst.getUMat(cv::ACCESS_READ);
    cv::UMat tmp_mat = cv::UMat(dst.size(), CV_8UC1, cv::Scalar(0));

    for (size_t i = 0; i < 1000; i++)
    {
        tmp_mat = tmp_mat.mul(tmp_dst);
    }
}

It took about 3000ms when I used only CPU.当我只使用 CPU 时,大约需要 3000 毫秒。 When I used Intel UHD Graphics 630, it took 3500ms.当我使用 Intel UHD Graphics 630 时,它花了 3500 毫秒。 And I also tried GTX1050, but it took about 3000ms.而且我也试过GTX1050,但是用了大约3000ms。

Please give me some ideas to speed it up.请给我一些想法以加快速度。 I should make it at least 1000ms.我应该让它至少 1000 毫秒。 Should I use AMP or OpenMP?我应该使用 AMP 还是 OpenMP? But as I know, they can only compute simple operations, not suitable for OpenCV functions.但据我所知,它们只能计算简单的操作,不适用于 OpenCV 函数。

Basically, Your code is slow because the way OpenCV uses OpenCL is inefficient.基本上,您的代码很慢,因为 OpenCV 使用 OpenCL 的方式效率低下。 It has nothing to do with the underlying hardware.它与底层硬件无关。

In order for OpenCL code (or any GPU related code for that matter) to be efficient, it is crucial for the host side code to properly utilize the GPU.为了使 OpenCL 代码(或任何与此相关的 GPU 相关代码)高效,主机端代码正确利用 GPU 至关重要。 To name a few principles:举几个原则:

  • Saturate the GPU by asynchronously enqueuing many computations (kernels).通过将许多计算(内核)异步排队来使 GPU 饱和。
  • Avoid unnecessary synchronizations.避免不必要的同步。
  • Avoid unnecessary memory copies between host CPU and GPU device.避免主机 CPU 和 GPU 设备之间不必要的内存复制。

Even if you write the most optimized GPU kernels, but fail to adhere to these basics, you are very unlikely to gain any performance boosts.即使您编写了最优化的 GPU 内核,但未能遵守这些基础知识,您也不太可能获得任何性能提升。

The OpenCV codebase is a great example of how not to adhere to these principles. OpenCV 代码库是如何遵守这些原则的一个很好的例子。

As for your example, if you rewrite your code to avoid memory copies and use device memory explicitly, you might witness a reasonable performance:对于您的示例,如果您重写代码以避免内存复制并显式使用设备内存,您可能会看到合理的性能:

auto frame1 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
auto frame2 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
auto frame3 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);

for (size_t i = 0; i < 10; i++)
{
    cv::multiply(frame1, frame2, frame3);
}

But in any case, I recommend you learn using the OpenCL API without OpenCV.但无论如何,我建议您在不使用 OpenCV 的情况下学习使用 OpenCL API。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM