简体繁体 English

GPU 与 OpenGL、GLSL 和帧缓冲对象的图像处理 - 关于性能的问题

[英]Image processing on the GPU with OpenGL, GLSL and Framebuffer Objects - questions about performance

原文 2011-05-31 11:41:24 1 1 performance/ opengl/ image-processing/ glsl/ gpu

I was included in a project, which does image processing on the CPU and is currently being extended to use the GPU as well , with the hopes being to use mainly the GPU, if that proves to be faster, and have the CPU processing part as a fall-back.我参与了一个项目，该项目在 CPU 上进行图像处理，目前正在扩展以使用GPU ，希望主要使用 GPU，如果证明更快，并且 CPU 处理部分为一个后备。 I am new to GPU programming and have a few questions, aspects of which I have seen discussed in other threads, but haven't been able to find the answers I need.我是 GPU 编程的新手，并且有一些问题，我在其他线程中已经讨论过这些问题，但无法找到我需要的答案。

If we were starting from scratch, what technology would you recommend for image processing on the GPU, in order to achieve the optimum combination of coverage (as in support on client machines) and speed?如果我们从头开始，您会推荐什么技术在 GPU 上进行图像处理，以实现覆盖范围（如客户端机器上的支持）和速度的最佳组合？ We've gone down the OpenGL + GLSL way as a way of covering as many graphics cards as possible and I am curious whether this is the optimal choice.我们已经采用OpenGL + GLSL方式作为覆盖尽可能多的显卡的方式，我很好奇这是否是最佳选择。 What would you say about OpenCL , for example?例如，您对OpenCL有什么看法？
Given we have already started implementing the GPU module with OpenGL and shaders, I would like to get an idea of whether we are doing that the most efficient way.鉴于我们已经开始使用 OpenGL 和着色器来实现 GPU 模块，我想知道我们这样做是否是最有效的方式。
We use Framebuffer Objects to read from and to render to textures.我们使用Framebuffer Objects来读取和渲染纹理。 In most cases the area that is being read and the area that is being written to are the same size, but the textures we read from and write to could be an arbitrary size .在大多数情况下，正在读取的区域和正在写入的区域大小相同，但我们读取和写入的纹理可以是任意大小。 In other words, we ask the FBO to read a subarea of what is considered to be its input texture and write to a subarea of what is considered to be its output texture.换句话说，我们要求 FBO 读取被认为是其输入纹理的子区域，并写入被认为是其 output 纹理的子区域。 For that purpose the output texture is "attached" to the Framebuffer Object (with glFramebufferTexture2DEXT ()), but the input one is not.为此，output 纹理“附加”到帧缓冲区 Object（使用glFramebufferTexture2DEXT ()），但输入的不是。 This requires textures to be "attached" and "detached", as they change their roles (ie a texture could be initially used for writing to, but in the next pass it could be used as an input to read from).这要求纹理被“附加”和“分离”，因为它们改变了它们的角色（即纹理最初可以用于写入，但在下一次传递中，它可以用作读取的输入）。
Would, instead of that, forcing the inputs and outputs to be the same size and always having them attached to the FBO make more sense , in terms of using the FBO efficiently and achieving better performance or does what we already do sound good enough?相反，强制输入和输出大小相同并始终将它们附加到 FBO 是否更有意义，就有效地使用 FBO 并获得更好的性能而言，还是我们已经做的听起来足够好？
The project was initially designed to render on the CPU, so care was taken for requests to be made to render as fewer pixels as possible at a time.该项目最初设计为在 CPU 上渲染，因此需要注意一次渲染尽可能少的像素的请求。 So, whenever a mouse move happens, for example, only a very small area around the cursor would be re-rendered.因此，每当发生鼠标移动时，例如，只有 cursor 周围的一个非常小的区域会被重新渲染。 Or, when rendering a whole image that covers the screen, it might be chopped into strips to be rendered and displayed one after the other.或者，在渲染覆盖屏幕的整个图像时，它可能会被分割成条带，一个接一个地渲染和显示。 Does such fragmentation make sense, when rendering on the GPU?在 GPU 上渲染时，这种碎片是否有意义？ What would be the best way to determine the optimum size for a render request (ie an output texture), so that the GPU is fully utilised?确定渲染请求的最佳大小（即 output 纹理）的最佳方法是什么，以便充分利用 GPU？
What considerations would there be when profiling code (for performance), that runs on the GPU ?在分析在 GPU 上运行的代码（出于性能考虑）时会有哪些注意事项？ (To compare it with rendering on the CPU.) Does measuring how long calls take to return (and calling glFinish() to ensure commands have completed on the GPU) sound useful or is there anything else to keep in mind? （将其与 CPU 上的渲染进行比较。）测量调用返回所需的时间（并调用 glFinish() 以确保命令已在 GPU 上完成）听起来有用还是有其他要记住的？

Thank you very much!非常感谢！

I think I need to add a couple of details to clarify my questions:我想我需要添加一些细节来澄清我的问题：

2) We aren't actually using the same texture as a rendering target and reading source at the same time. 2）我们实际上并没有同时使用相同的纹理作为渲染目标和读取源。 It's only when rendering has finished that an "output" texture becomes "input" - ie when the result of a render job needs to be read for another pass or as an input for another filter.只有在渲染完成后，“输出”纹理才会变为“输入”——即，当需要读取渲染作业的结果以进行另一次传递或作为另一个过滤器的输入时。

What I was concerned with was whether attached textures are treated differently, as in whether the FBO or shader would have faster access to them, compared with when they aren't attached.我关心的是附加的纹理是否被不同地处理，例如 FBO 或着色器是否可以更快地访问它们，与未附加时相比。

My initial (though probably not totally accurate) profiling didn't show dramatic differences, so I guess we aren't committing that much of a performance crime.我最初的（虽然可能不完全准确）的分析没有显示出巨大的差异，所以我想我们并没有犯下那么多的性能犯罪。 I'll do more tests with the timing functions you suggested - these look useful.我会用你建议的计时功能做更多的测试——这些看起来很有用。

3) I was wondering whether chopping a picture into tiny pieces (say as small as 100 x 100 pixels for a mouse move) and requesting them to be rendered one by one would be slower or faster (or whether it wouldn't matter) on a GPU, which could potentially paralellise a lot of the work. 3）我想知道将图片切成小块（比如鼠标移动小至 100 x 100 像素）并要求它们逐个渲染会更慢或更快（或者是否无关紧要）一个 GPU，它可能会并行化很多工作。 My gut feeling is that that might be overzealous optimisation that, in the best case, won't buy us much and in the worst, might hurt performance, so was wondering whether there is a formal way of telling for a particular implementation.我的直觉是，这可能是过分热心的优化，在最好的情况下，不会给我们带来太多收益，在最坏的情况下，可能会损害性能，所以想知道是否有一种正式的方式来说明特定的实现。 In the end, I guess we'd go with what seems reasonable across various graphics cards.最后，我想我们会使用 go 与各种显卡的合理配置。

1 个解决方案

I don't have too much insight into your project, but I'll try to provide some simple answers, perhaps others can be more detailed:我对您的项目没有太多的了解，但我会尝试提供一些简单的答案，也许其他人可以更详细：

As long as you do the usual modify-output-pixels-using-some-input-pixels tasks from image processing without much synchronization, you should be fine with the usual screen-sized-quad-with-fragment-shader approach (sorry for these strange phrases).只要您在没有太多同步的情况下从图像处理中执行通常的 modify-output-pixels-using-some-input-pixels 任务，您应该可以使用通常的屏幕尺寸四边形与片段着色器方法（对不起这些奇怪的短语）。 And you get image filtering (like bilinear interpolation) for free (I don't know if CUDA or OpenCL support image filtering, although they should, as the hardware is there anyway).而且您可以免费获得图像过滤（如双线性插值）（我不知道 CUDA 或 OpenCL 是否支持图像过滤，尽管它们应该支持，因为硬件仍然存在）。
You cannot read from a texture that is used as a render target anyway (although they may still be attached I think), so your current approach should be fine.无论如何，您都无法从用作渲染目标的纹理中读取数据（尽管我认为它们可能仍会附加），因此您当前的方法应该没问题。 Requiring them to be same size only to let them attached to the FBO would limit the flexibility very much for quite nothing (I think the attaching cost is negligable).要求它们大小相同只是为了让它们连接到 FBO 会极大地限制灵活性（我认为连接成本可以忽略不计）。
The optimal size is really implementation dependent, but limiting the rendered range and therefore the fragment shader invocations should always be a good idea, as long as these limiting computations don't last too long (simple bounding boxes with glScissor are your friend, I think, or just using a smaller than screen size quad).最佳尺寸实际上取决于实现，但限制渲染范围，因此片段着色器调用应该始终是一个好主意，只要这些限制计算不会持续太久（我认为使用glScissor的简单边界框是你的朋友，或仅使用小于屏幕尺寸的四边形）。
There are other perhaps much more accurate methods for timing the GPU (look at the GL_ARB_timer_query extension, for example).还有其他可能更准确的方法来计时 GPU（例如，查看GL_ARB_timer_query扩展）。 For profiling and debugging you can use general GPU profilers and debuggers, as gDEBugger and the like, I think.对于分析和调试，您可以使用通用 GPU 分析器和调试器，如 gDEBugger 等，我认为。 Although I don't have much experience with such tools.虽然我对这些工具没有太多经验。

EDIT: To you edited questions:编辑：对您编辑的问题：

I really doubt, that an attached texture is read faster than a non-attached one.我真的怀疑，附加纹理的读取速度是否比非附加纹理快。 The only thing you would gain is that you don't need to reattach it when you want to write into it, but as I said, that cost should be negligable, if any.您唯一可以获得的是，当您想写入它时，您不需要重新附加它，但正如我所说，如果有的话，这个成本应该可以忽略不计。
I would not over optimize that by tiling it into too small pieces.我不会通过将其平铺成太小的部分来过度优化它。 Like I said, when working with GL, you can use the scissor test and the stencil test for such things.就像我说的，在使用 GL 时，您可以使用剪刀测试和模板测试来处理这些事情。 But it all has to be tested, I think, to be sure of the performance gain.但我认为，这一切都必须经过测试，以确保性能提升。 I don't know, what you mean with mouse move, as when you just move the mouse over your window, the window system usually takes care of rendering the cursor as an overlay so you need not redraw the underlying image again, as it is buffered by the window system, I think.我不知道，您的鼠标移动是什么意思，因为当您将鼠标移到 window 上时，window 系统通常负责渲染 cursor，因为它不需要重新绘制底层图像，所以您需要再次绘制底层图像我认为由 window 系统缓冲。