Rendering Quads Performance with Metal

Question

I'm trying to render a large number of very small 2D quads as fast as possible on an Apple A7 GPU using the Metal API. Researching that GPU's triangle throughput numbers, eg here , and from Apple quoting >1M triangles on screen during their keynote demo, I'd expect to be able to render something like 500,000 such quads per frame at 60fps. Perhaps a bit less, given that all of them are visible (on screen, not hidden by z-buffer) and tiny (tricky for the rasterizer), so this likely isn't a use case that the GPU is super well optimized for. And perhaps that Apple demo was running at 30fps, so let's say ~200,000 should be doable. Certainly 100,000 ... right?

However, in my test app the max is just ~20,000 -- more than that and the framerate drops below 60 on an iPad Air. With 100,000 quads it runs at 14 fps, ie at a throughput of 2.8M trianlges/sec (compare that to the 68.1M onscreen triangles quoted in the AnandTech article!).

Even if I make the quads a single pixel small, with a trivial fragment shader, performance doesn't improve. So we can assume that this is vertex bound, and the GPU report in Xcode agrees ("Tiler" is at 100%). The vertex shader is trivial as well, doing nothing but a little scaling and a translation math, so I'm assuming the bottleneck is some fixed-function stage...?

Just for some more background info, I'm rendering all the geometry using a single instanced draw call, with one quad per instance, ie 4 vertices per instance. The quad's positions are applied from a separate buffer that's indexed by instance id in the vertex shader. I've tried a few other methods as well (non-instanced with all vertices pre-transformed, instanced+indexed, etc), but that didn't help. There are no complex vertex attributes, buffer/surface formats, or anything else I can think of that seems likely to hit a slow path in the driver/GPU (though I can't be sure of course). Blending is off. Pretty much everything else is in the default state (things like viewport,scissor,ztest,culling,etc).

The application is written in Swift, though hopefully that doesn't matter ;)

What I'm trying to understand is whether the performance I'm seeing is expected when rendering quads like this (as opposed to a "proper" 3d scene), or whether some more advanced techniques are needed to get anwhere close to the advertised triangle throughputs. What do people think is likely the limiting bottleneck here?

Also, if anyone knows any reason why this might be faster in OpenGL than in Metal (I haven't tried, and can't think of any reason), then I'd love to hear it as well.

Thanks

Edit: adding shader code.

vertex float4 vertex_shader(
        const constant float2* vertex_array [[ buffer(0) ]],
        const device QuadState* quads [[ buffer(1) ]],
        constant const Parms& parms [[ buffer(2) ]],
        unsigned int vid [[ vertex_id ]],
        unsigned int iid [[ instance_id ]] )
{
    float2 v = vertex_array[vid]*0.5f;

    v += quads[iid].position;

    // ortho cam and projection transform
    v += parms.cam.position;
    v *= parms.cam.zoom * parms.proj.scaling;

    return float4(v, 0, 1.0);
}


fragment half4 fragment_shader()
{
    return half4(0.773,0.439,0.278,0.4);
}

Answer 1

Without seeing your Swift/Objective-C code I cannot be sure, but my guess is you are spending too much time calling your instancing code. Instancing is useful when you have a model with hundreds of triangles in it, not for two.

Try creating a vertex buffer with 1000 quads in it and see if the performance increases.

Rendering Quads Performance with Metal

Question

1 answers

solution1
1 2015-05-07 02:58:03

Rendering Quads Performance with Metal

Question

1 answers

solution1 1 2015-05-07 02:58:03

solution1
1 2015-05-07 02:58:03