简体繁体 English

使用Cuda优化方法进行OpenCL

[英]Using Cuda optimization approaches for OpenCL

原文 2011-05-09 15:26:53 7 2 optimization/ cuda/ opencl

The more I learn about OpenCL, the more it seems that the right optimization of your kernel is the key to success. 我对OpenCL的了解越多，内核的正确优化就越是成功的关键。 Furthermore I noticed, that the kernels for both languages seem very similar. 此外，我注意到，两种语言的内核看起来非常相似。

So how sensible would it be using Cuda optimization strategies learned from books and tutorials on OpenCL kernels? 那么使用从OpenCL内核的书籍和教程中学到的Cuda优化策略是多么明智？ ... Considering that there is so much more (good) literature for Cuda than for OpenCL. ...考虑到Cuda有比（OpenCL）更多（好）的文献。

What is your opinion on that? 你对此有何看法？ What is your experience? 你有什么经历？

Thanks! 谢谢！

2 个解决方案

If you are working with just nvidia cards, you can use the same optimization approaches in both CUDA as well as OpenCL. 如果您只使用nvidia卡，则可以在CUDA和OpenCL中使用相同的优化方法。 A few things to keep in mind though is that OpenCL might have a larger start up time (This was a while ago when I was experimenting with both of them) compared to CUDA on nvidia cards. 需要记住的一点是，与Nvidia显卡上的CUDA相比，OpenCL可能有更大的启动时间（这是我不久前在试验它们时）。

However if you are going to work with different architectures, you will need to figure out a way to generalize your OpenCL program to be optimal across multiple platforms, which is not possible with CUDA. 但是，如果您要使用不同的体系结构，则需要找到一种方法来将OpenCL程序概括为在多个平台上实现最佳，这是CUDA无法实现的。

But some of the few basic optimization approaches will remain the same. 但是一些基本的优化方法将保持不变。 For example, on any platform the following will be true. 例如，在任何平台上，以下都是如此。

Reading from and writing to memory addresses that are aligned will have higher performance (And sometimes necessary on platforms like the Cell Processor). 读取和写入对齐的内存地址将具有更高的性能（有时在像Cell处理器这样的平台上是必需的）。
Knowing and understanding the limited resources of each platform. 了解并了解每个平台的有限资源。 (may it be called constant memory, shared memory, local memory or cache). （可称为常量内存，共享内存，本地内存或缓存）。
Understanding parallel programming. 了解并行编程。 For example, figuring out the trade off between performance gains (launching more threads) and overhead costs (launching, communication and synchronization). 例如，找出性能增益（启动更多线程）和间接成本（启动，通信和同步）之间的权衡。

That last part is useful in all kinds of parallel programming (be multi core, many core or grid computing). 最后一部分在各种并行编程中都很有用（多核，多核或网格计算）。

While I'm still new at OpenCL (and barely glanced at CUDA), optimization at the developer level can be summarized as structuring your code so that it matches the hardware's (and compiler's) preferred way of doing things. 虽然我还是OpenCL的新手（并且几乎没看过CUDA），但开发人员级别的优化可以概括为构建代码，以便它与硬件（和编译器）的首选方式相匹配。

On GPUs, this can be anything from correctly ordering your data to take advantage of cache coherency (GPUs LOVE to work with cached data, from the top all the way down to the individual cores [there are several levels of cache]) to taking advantage of built-in operations like vector and matrix manipulation. 在GPU上，这可以是正确排序数据以利用缓存一致性（GPU可以使用缓存数据，从顶部一直到各个核心[有多个级别的缓存]）来利用缓存一致性内置操作，如矢量和矩阵操作。 I recently had to implement FDTD in OpenCL and found that by replacing the expanded dot/cross products in the popular implementations with matrix operations (which GPUs love!), reordering loops so that the X dimension (elements of which are stored sequentially) is handled in the innermost loop instead of the outer, avoiding branching (which GPUs hate), etc, I was able to increase the speed performance by about 20%. 我最近不得不在OpenCL中实现FDTD ，并发现通过矩阵操作（GPU喜欢！）替换流行实现中的扩展点/交叉产品，重新排序循环以便处理X维度（其元素按顺序存储）在最里面的循环而不是外部，避免分支（GPU讨厌）等，我能够将速度性能提高约20％。 Those optimizations should work in CUDA, OpenCL or even GPU assembly, and I would expect that to be true of all of the most effective GPU optimizations. 这些优化应该适用于CUDA，OpenCL甚至GPU组装，我希望所有最有效的GPU优化都能实现。

Of course, most of this is application-dependent, so it may fall under the TIAS (try-it-and-see) category. 当然，其中大部分都是依赖于应用程序的，因此它可能属于TIAS（试一试）类别。

Here are a few links I found that look promising: 以下是我发现看起来很有希望的一些链接：

NVIDIA - Best Practices for OpenCL Programming NVIDIA - OpenCL编程的最佳实践

AMD - Porting CUDA to OpenCL AMD - 将CUDA移植到OpenCL

My research (and even NVIDIA's documentation) points to a nearly 1:1 correspondence between CUDA and OpenCL, so I would be very surprised if optimizations did not translate well between them. 我的研究（甚至是NVIDIA的文档）指出CUDA和OpenCL之间几乎是1：1的对应关系，所以如果优化在他们之间没有很好的转换，我会非常惊讶。 Most of what I have read focuses on cache coherency, avoiding branching, etc. 我所阅读的大部分内容都集中在缓存一致性，避免分支等方面。

Also, note that in the case of OpenCL, the actual compilation process is handled by the vendor (I believe it happens in the video driver), so it may be worthwhile to have a look at the driver documentation and OpenCL kits from your vendor (NVIDIA, ATI, Intel(?), etc). 另外，请注意，在OpenCL的情况下，实际的编译过程由供应商处理（我相信它发生在视频驱动程序中），因此从供应商处查看驱动程序文档和OpenCL工具包可能是值得的（ NVIDIA，ATI，英特尔（？）等）。