简体   繁体   English

从OpenGL中的默认帧缓冲区中读取像素数据:FBO与PBO的性能

[英]Read pixel data from default framebuffer in OpenGL: Performance of FBO vs. PBO

My goal is to read the contents of the default OpenGL framebuffer and store the pixel data in a cv::Mat . 我的目标是读取默认OpenGL帧缓冲区的内容,并将像素数据存储在cv::Mat Apparently there are two different ways of achieving this: 显然,有两种不同的方法可以实现此目的:

1) Synchronous: use FBO and glRealPixels 1)同步:使用FBO和glRealPixels

cv::Mat a = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, a.data);

2) Asynchronous: use PBO and glReadPixels 2)异步:使用PBO和glReadPixels

cv::Mat b = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo_userImage);
    glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, 0);
    unsigned char* ptr = static_cast<unsigned char*>(glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY));
    std::copy(ptr, ptr + 1920 * 1080 * 3 * sizeof(unsigned char), b.data);
    glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

From all the information I collected on this topic, the asynchronous version 2) should be much faster. 从我收集的有关该主题的所有信息中,异步版本2)应该更快。 However, comparing the elapsed time for both versions yields that the differences are often times minimal, and sometimes version 1) events outperforms the PBO variant. 但是,比较两个版本的经过时间会发现差异通常是最小的,有时版本1)事件的性能优于PBO变体。

For performance checks, I've inserted the following code (based on this answer): 为了性能检查,我插入了以下代码(基于答案):

std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
....
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << std::endl;

I've also experimented with the usage hint when creating the PBO: I didn't find much of difference between GL_DYNAMIC_COPY and GL_STREAM_READ here. 在创建PBO时,我还尝试了用法提示 :在这里,我发现GL_DYNAMIC_COPYGL_STREAM_READ之间没有太大区别。

I'd be happy for suggestions how to increase the speed of this pixel read operation from the framebuffer even further. 我很乐意提供一些建议,以进一步提高从帧缓冲区读取像素的速度。

Your second version is not asynchronous at all, since you're mapping the buffer immediately after triggering the copy. 您的第二个版本根本不是异步的,因为您是在触发副本后立即映射缓冲区。 The map call will then block until the contents of the buffer are available, effectively becoming synchronous. 然后,映射调用将阻塞,直到缓冲区的内容可用为止,从而有效地变得同步。

Or: depending on the driver, it will block when actually reading from it. 或:根据驱动程序,在实际读取驱动程序时它将阻塞。 In other words the driver may implement the mapping in such a way that it causes a pagefault, and a subsequent synchronization. 换句话说,驱动程序可以以导致页面错误和随后的同步的方式来实现映射。 It doesn't really matter in your case, since you are still accessing that data straight away due to the std::copy . 在您的情况下,这并不重要,因为由于std::copy ,您仍在直接访问该数据。

The proper way of doing this is by using sync objects and fences . 正确的方法是使用同步对象和围栅

Keep your PBO setup, but after issuing the glReadPixels into a PBO, insert a sync object into the stream via glFenceSync . 保持您的PBO设置,但是在将glReadPixels发行到PBO中之后,通过glFenceSync将同步对象插入流中。 Then, some time later, poll for that fence sync object to be complete (or just wait for it altogether) via glClientWaitSync . 然后,一段时间后,通过glClientWaitSync轮询该篱笆同步对象是否完整(或完全等待)。

If glClientWaitSync returns that the commands before the fence are complete, you can now read from the buffer without an expensive CPU/GPU sync. 如果glClientWaitSync在防护隔离完成之前返回命令,则您现在可以从缓冲区读取数据,而无需进行昂贵的CPU / GPU同步。 (If the driver is particularly stupid and didn't already move the buffer contents into mappable addresses, in spite of your usage hints on the PBO, you can use another thread to perform the map. glGetBufferSubData can be therefore cheaper, as the data doesn't need to be in a mappable range.) (如果驱动程序是特别愚蠢的,并没有缓冲区的内容已经进入可映射地址,尽管在PBO您使用的提示,你可以使用另一个线程来执行地图。 glGetBufferSubData可以因此更便宜,因为数据没有按不必在可映射范围内。)


If you need to do this on a frame-by-frame basis, you'll notice that it's very likely that you'll need more than one PBO, that is, have a small pool of them. 如果您需要逐帧执行此操作,您会注意到很可能需要多个PBO,也就是说,它们的池子很小。 This is because at the next frame the readback of the previous frame's data is not complete yet and the corresponding fence not signalled. 这是因为在下一帧,尚未完成对前一帧数据的回读,并且未发出相应的信号。 (Yes, GPUs are massively pipelined these days, and they will be some frames behind your submission queue). (是的,这些天GPU已大量流水线化,它们将在您的提交队列后面一些帧)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM