简体   繁体   中英

Is computing integral image on GPU really faster than on CPU?

I'm new to GPU computing, so this maybe a really naive question.
I did a few look-ups, and it seems computing integral image on GPU is a pretty good idea.
However, when I really dig into it, I'm wondering maybe it's not faster than CPU, especially for big image. So I just wanna know your ideas about it, and some explanation if GPU is really faster.

So, assuming we have a MxN image, CPU computing of the integral image would need roughly 3xMxN addition, which is O(MxN).
On GPU, follow the code provided by the "OpenGL Super Bible" 6th edition, it would need some KxMxNxlog2(N) + KxMxNxlog2(M) operation, in which K is the number of operations for a lot of bit-shifting, multiplication, addition...
The GPU can work parallel on, say, 32 pixels at a time depend on the device, but it's still O(MxNxlog2(M)).
I think even at the common resolution of 640x480, the CPU is still faster.

Am I wrong here?
[Edit] This is the shader code straight from the book, the idea is using 2 pass: calculate integral of the rows, then calculate the integral of the column of the result from pass 1. This shader code is for 1 pass.

#version 430 core
layout (local_size_x = 1024) in;
shared float shared_data[gl_WorkGroupSize.x * 2];
layout (binding = 0, r32f) readonly uniform image2D input_image;
layout (binding = 1, r32f) writeonly uniform image2D output_image;
void main(void)
{
    uint id = gl_LocalInvocationID.x;
    uint rd_id;
    uint wr_id;
    uint mask;
    ivec2 P = ivec2(id * 2, gl_WorkGroupID.x);
    const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
    uint step = 0;
    shared_data[id * 2] = imageLoad(input_image, P).r;
    shared_data[id * 2 + 1] = imageLoad(input_image,
    P + ivec2(1, 0)).r;
    barrier();
    memoryBarrierShared();
    for (step = 0; step < steps; step++)
    {
        mask = (1 << step) - 1;
        rd_id = ((id >> step) << (step + 1)) + mask;
        wr_id = rd_id + 1 + (id & mask);
        shared_data[wr_id] += shared_data[rd_id];
        barrier();
        memoryBarrierShared();
    }
    imageStore(output_image, P.yx, vec4(shared_data[id * 2]));
    imageStore(output_image, P.yx + ivec2(0, 1),
    vec4(shared_data[id * 2 + 1]));
}

What do you mean by integral image ?

My assumption is summing K images of the same resolution MxN together. in such case it is O(KMN) on booth CPU and GPU but the constant time can be better on GPU as gfx memory access is much faster than on CPU side. There are also usually more GPU cores than CPU cores favoring the GPU for this.

If the K is too big to fit into GPU texture units U at once than you need to use multiple passes so O(KMNlog(K)/log(U)) K>U ... where CPU might be faster in some cases. But as previous comment suggested without a test you can only guess. You need also take into account that there are thing like bind-less texturing and texture arrays which allows to do this in single pass (but I am unsure if there are any performance costs for that).

[Edit1] after clearing what you actually want to do

First let assume for simplicity we got square input image NxN pixels. So we can divide the task into H-lines and V-lines separately (similar to 2D FFT ) to ease up this process. On top of that we can use subdivision of each line into group of M pixels. So:

N = M.K

Where N is resolution, M is region resolution and K is number of regions.

  1. 1st. pass

    Render line for each group so we got K lines of size M . Using fragment shader that computes integral image of each region only outputting to some texture. This is T(0.5*K*M^2*N) This whole thing can be done in fragment rendered by single QUAD covering the screen ...

  2. 2nd. pass

    Convert region integrals to full image integrals. So again render K lines and in fragment add all the last pixels of each previous group. This is T(0.5*K^3*N) This whole thing can too be done in fragment rendered by single QUAD covering the screen ...

  3. do #1,#2 on the result in the other axis direction

This whole thing converts to

T(2*N*(0.5*K*M^2+0.5*K^3))
T(N*(K*M^2+K^3))
O(N*(K*M^2+K^3))

Now you can tweak the M to max performance on your setup ... If I rewrite the whole thing into M,N then:

T(N*((N/M)*M^2+(N/M)^3))
T(N*(N*M+(N/M)^3))

So you should minimize the therm so I would try to use values around

N*M = (N/M)^3
N*M = N^3/M^3
M^4 = N^2
M^2 = N
M = sqrt(N) = N^0.5

So the whole thing converts to:

T(N*(N*M+(N/M)^3))
T(N*(N*N^0.5+(N/N^0.5)^3))
T(N^2.5+N^1.5)
O(N^2.5)

Which is faster than naive O(N^4) But you're right CPU has less operations to do O(N^2) for this and does not require copy of data or multiple passes so you should find out the threshold resolution on specific HW for your task and chose depending on the measurements. PS Hope I did not do a silly mistake somewhere in the computations. Also if you do H and V lines separately on CPU than the CPU side complexity will be O(N^3) and using this approach even O(N^2.5) without the need for 2 pass per axis.

Take a look at this similar QA:

I think it is a good start point.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM