简体   繁体   English

在GPU上计算积分图像是否真的比在CPU上更快?

[英]Is computing integral image on GPU really faster than on CPU?

I'm new to GPU computing, so this maybe a really naive question. 我是GPU计算的新手,所以这可能是一个非常幼稚的问题。
I did a few look-ups, and it seems computing integral image on GPU is a pretty good idea. 我进行了一些查找,似乎在GPU上计算积分图像是一个不错的主意。
However, when I really dig into it, I'm wondering maybe it's not faster than CPU, especially for big image. 但是,当我真正研究它时,我想知道它是否没有比CPU快,特别是对于大图像。 So I just wanna know your ideas about it, and some explanation if GPU is really faster. 所以我只想了解您对此的想法,并解释一下GPU是否真的更快。

So, assuming we have a MxN image, CPU computing of the integral image would need roughly 3xMxN addition, which is O(MxN). 因此,假设我们有一个MxN图像,则积分图像的CPU计算大约需要3xMxN加法,即O(MxN)。
On GPU, follow the code provided by the "OpenGL Super Bible" 6th edition, it would need some KxMxNxlog2(N) + KxMxNxlog2(M) operation, in which K is the number of operations for a lot of bit-shifting, multiplication, addition... 在GPU上,按照“ OpenGL超级圣经”第6版提供的代码,它需要一些KxMxNxlog2(N)+ KxMxNxlog2(M)运算,其中K是很多移位,乘法,加成...
The GPU can work parallel on, say, 32 pixels at a time depend on the device, but it's still O(MxNxlog2(M)). 根据设备的不同,GPU一次可以并行工作32个像素,但仍为O(MxNxlog2(M))。
I think even at the common resolution of 640x480, the CPU is still faster. 我认为,即使在普通分辨率为640x480的情况下,CPU仍会更快。

Am I wrong here? 我在这里错了吗?
[Edit] This is the shader code straight from the book, the idea is using 2 pass: calculate integral of the rows, then calculate the integral of the column of the result from pass 1. This shader code is for 1 pass. [编辑]这是直接从书中获取的着色器代码,其想法是使用2次传递:计算行的积分,然后计算传递1次结果的列的积分。此着色器代码适用于1次传递。

#version 430 core
layout (local_size_x = 1024) in;
shared float shared_data[gl_WorkGroupSize.x * 2];
layout (binding = 0, r32f) readonly uniform image2D input_image;
layout (binding = 1, r32f) writeonly uniform image2D output_image;
void main(void)
{
    uint id = gl_LocalInvocationID.x;
    uint rd_id;
    uint wr_id;
    uint mask;
    ivec2 P = ivec2(id * 2, gl_WorkGroupID.x);
    const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
    uint step = 0;
    shared_data[id * 2] = imageLoad(input_image, P).r;
    shared_data[id * 2 + 1] = imageLoad(input_image,
    P + ivec2(1, 0)).r;
    barrier();
    memoryBarrierShared();
    for (step = 0; step < steps; step++)
    {
        mask = (1 << step) - 1;
        rd_id = ((id >> step) << (step + 1)) + mask;
        wr_id = rd_id + 1 + (id & mask);
        shared_data[wr_id] += shared_data[rd_id];
        barrier();
        memoryBarrierShared();
    }
    imageStore(output_image, P.yx, vec4(shared_data[id * 2]));
    imageStore(output_image, P.yx + ivec2(0, 1),
    vec4(shared_data[id * 2 + 1]));
}

What do you mean by integral image ? integral image是什么意思?

My assumption is summing K images of the same resolution MxN together. 我的假设是将具有相同分辨率MxN K张图像相加。 in such case it is O(KMN) on booth CPU and GPU but the constant time can be better on GPU as gfx memory access is much faster than on CPU side. 在这种情况下,它是O(KMN)展台CPUGPU但所述恒定时间可以在GPU作为GFX存储器访问更好上比在CPU侧快得多。 There are also usually more GPU cores than CPU cores favoring the GPU for this. 还有平时多GPU核心比CPU内核有利于这种情况的GPU。

If the K is too big to fit into GPU texture units U at once than you need to use multiple passes so O(KMNlog(K)/log(U)) K>U ... where CPU might be faster in some cases. 如果K太大而无法一次放入GPU纹理单元U ,则您需要使用多次O(KMNlog(K)/log(U)) K>U因此O(KMNlog(K)/log(U)) K>U ...在某些情况下, CPU可能会更快。 But as previous comment suggested without a test you can only guess. 但是,正如之前的评论所建议的那样,未经测试,您只能猜测。 You need also take into account that there are thing like bind-less texturing and texture arrays which allows to do this in single pass (but I am unsure if there are any performance costs for that). 您还需要考虑到诸如无绑定纹理和纹理数组之类的东西可以单次执行(但我不确定这样做是否会产生任何性能成本)。

[Edit1] after clearing what you actually want to do [Edit1]清除您实际要执行的操作后

First let assume for simplicity we got square input image NxN pixels. 首先,为简单起见,我们假设输入图像为NxN像素。 So we can divide the task into H-lines and V-lines separately (similar to 2D FFT ) to ease up this process. 因此,我们可以将任务分为H线和V线(类似于2D FFT )以简化此过程。 On top of that we can use subdivision of each line into group of M pixels. 最重要的是,我们可以将每行细分为M像素。 So: 所以:

N = M.K

Where N is resolution, M is region resolution and K is number of regions. 其中N是分辨率, M是区域分辨率, K是区域数。

  1. 1st. 1。 pass 通过

    Render line for each group so we got K lines of size M . 为每个组渲染线,因此我们得到了K条尺寸为M线。 Using fragment shader that computes integral image of each region only outputting to some texture. 使用片段着色器仅计算输出到某些纹理的每个区域的积分图像。 This is T(0.5*K*M^2*N) This whole thing can be done in fragment rendered by single QUAD covering the screen ... 这是T(0.5*K*M^2*N)这整个事情可以由单个QUAD覆盖屏幕的片段完成...

  2. 2nd. 第2位。 pass 通过

    Convert region integrals to full image integrals. 将区域积分转换为完整图像积分。 So again render K lines and in fragment add all the last pixels of each previous group. 因此,再次渲染K线,并在片段中添加每个先前组的所有最后像素。 This is T(0.5*K^3*N) This whole thing can too be done in fragment rendered by single QUAD covering the screen ... 这是T(0.5*K^3*N)这整个事情也可以由单个QUAD覆盖屏幕的片段完成...

  3. do #1,#2 on the result in the other axis direction 在另一轴方向上对结果执行#1,#2

This whole thing converts to 这整个事情转换为

T(2*N*(0.5*K*M^2+0.5*K^3))
T(N*(K*M^2+K^3))
O(N*(K*M^2+K^3))

Now you can tweak the M to max performance on your setup ... If I rewrite the whole thing into M,N then: 现在,您可以调整M以使设置达到最大性能...如果我将整个内容重写为M,N则:

T(N*((N/M)*M^2+(N/M)^3))
T(N*(N*M+(N/M)^3))

So you should minimize the therm so I would try to use values around 因此,您应该最小化温度,以便尝试使用

N*M = (N/M)^3
N*M = N^3/M^3
M^4 = N^2
M^2 = N
M = sqrt(N) = N^0.5

So the whole thing converts to: 因此,整个过程转换为:

T(N*(N*M+(N/M)^3))
T(N*(N*N^0.5+(N/N^0.5)^3))
T(N^2.5+N^1.5)
O(N^2.5)

Which is faster than naive O(N^4) But you're right CPU has less operations to do O(N^2) for this and does not require copy of data or multiple passes so you should find out the threshold resolution on specific HW for your task and chose depending on the measurements. 这比朴素的O(N^4)快,但是您是对的, CPU只需较少的操作即可执行O(N^2) ,并且不需要数据复制或多次通过,因此您应该找出特定阈值的分辨率硬件为您的任务,并根据测量结果进行选择。 PS Hope I did not do a silly mistake somewhere in the computations. PS希望我在计算中的某处没有犯傻错误。 Also if you do H and V lines separately on CPU than the CPU side complexity will be O(N^3) and using this approach even O(N^2.5) without the need for 2 pass per axis. 同样,如果您在CPU上分别进行H和V线处理,则CPU端的复杂度将为O(N^3)并且使用此方法甚至O(N^2.5)也不需要每轴2遍。

Take a look at this similar QA: 看看以下类似的质量检查:

I think it is a good start point. 我认为这是一个很好的起点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM