简体繁体 English

OpenCL 算法加速

[英]OpenCL algorithm speedup

原文 2020-12-01 15:37:36 4 1 c/ parallel-processing/ opencl

I wrote a simple Opencl algorithm that uses my GPU to apply a filter to an image.我编写了一个简单的 Opencl 算法，它使用我的 GPU 对图像应用过滤器。 Everything works fine so I decided to write a C version of the algorithm that basically does the same task (in single core) to compare the different execution speeds.一切正常，所以我决定编写一个 C 版本的算法，它基本上执行相同的任务（在单核中）来比较不同的执行速度。 I ran the two algorithms 1000 times each and for the OpenCL version I get an average execution time of 1 ms whereas for the serial version I get an average of 36 ms, that's a huge difference so I was wondering if it's plausible such an improvement.我分别运行了这两种算法 1000 次，对于 OpenCL 版本，我的平均执行时间为 1 毫秒，而对于串行版本，我平均为 36 毫秒，这是一个巨大的差异，所以我想知道这样的改进是否合理。

1 个解决方案

You already answered your question: Your test showed a speedup of 36x.您已经回答了您的问题：您的测试显示加速了 36 倍。 That is not an uncommon result.这并不少见。 When going from a single-core CPU to a GPU implementation you may see no speedup at all (in your case if the image size would be very small, then PCIe latency would be more than the compute speedup you get) all the way to about 2000x (large image / perfectly parallelizable algorithm without communication between threads) depending on your hardware.当从单核 CPU 到 GPU 实现时，您可能根本看不到任何加速（在您的情况下，如果图像大小非常小，那么 PCIe 延迟将超过您获得的计算加速）一直到大约2000x（大图像/完全可并行化的算法，没有线程之间的通信）取决于您的硬件。 If you want to figure out how good exactly your implementation is, do a roofline analysis .如果您想弄清楚您的实施到底有多好，请进行屋顶线分析。