与OpenGL相比，金属渲染要慢得多，而在大纹理上渲染小纹理

Question

I am trying to migrate my projects from OpenGL to Metal on iOS. 我正在尝试将项目从OpenGL迁移到iOS上的Metal。 But I seem to have hit a performance wall. 但是我似乎已经碰壁了。 The task is simple... 任务很简单...

I have a large texture (more than 3000x3000 pixels). 我的纹理很大（超过3000x3000像素）。 On which I need to draw several (a few hundreds) small textures (say 124x124) on each touchesMoved event. 在每个touchesMoved事件上，我需要在其上绘制几个（数百个）小纹理（例如124x124）。 And this is while enabling a particular blending function. 这是在启用特定混合功能的同时。 It is basically like a paint brush. 它基本上就像一个油漆刷。 And then display the large texture. 然后显示大的纹理。 This is roughly the task is. 这大致就是任务。

On OpenGL it runs pretty fast. 在OpenGL上，它运行非常快。 I get around 60fps. 我大约达到60fps。 When I port the same code to Metal, I could manage to get only 15fps. 当我将相同的代码移植到Metal时，我只能设法获得15fps。

I have created two sample projects with bare minimum to demonstrate the problem. 我已经创建了两个示例项目，几乎没有演示这个问题。 Here are the projects (Both OpenGL and Metal)... 这是项目（OpenGL和Metal）...

https://drive.google.com/file/d/12MPt1nMzE2UL_s4oXEUoTCXYiTz42r4b/view?usp=sharing https://drive.google.com/file/d/12MPt1nMzE2UL_s4oXEUoTCXYiTz42r4b/view?usp=sharing

This is roughly what I do in OpenGL... 这大致就是我在OpenGL中所做的...

    - (void) renderBrush:(GLuint)brush on:(GLuint)fbo ofSize:(CGSize)size at:(CGPoint)point {
    GLfloat brushCoordinates[] = {
        0.0f, 0.0f,
        1.0f, 0.0f,
        0.0f,  1.0f,
        1.0f,  1.0f,
    };

    GLfloat imageVertices[] = {
        -1.0f, -1.0f,
        1.0f, -1.0f,
        -1.0f,  1.0f,
        1.0f,  1.0f,
    };

    int brushSize = 124;

    CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);

    rect.origin.x /= size.width;
    rect.origin.y /= size.height;
    rect.size.width /= size.width;
    rect.size.height /= size.height;

    [self convertImageVertices:imageVertices toProjectionRect:rect onImageOfSize:size];

    int currentFBO;
    glGetIntegerv(GL_FRAMEBUFFER_BINDING, &currentFBO);

    [_Program use];

    glBindFramebuffer(GL_FRAMEBUFFER, fbo);
    glViewport(0, 0, (int)size.width, (int)size.height);

    glActiveTexture(GL_TEXTURE2);
    glBindTexture(GL_TEXTURE_2D, brush);
    glUniform1i(brushTextureLocation, 2);

    glVertexAttribPointer(positionLocation, 2, GL_FLOAT, 0, 0, imageVertices);
    glVertexAttribPointer(brushCoordinateLocation, 2, GL_FLOAT, 0, 0, brushCoordinates);

    glEnable(GL_BLEND);
    glBlendEquation(GL_FUNC_ADD);
    glBlendFuncSeparate(GL_ONE, GL_ZERO, GL_ONE, GL_ONE);

    glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);

    glDisable(GL_BLEND);

    glActiveTexture(GL_TEXTURE2);
    glBindTexture(GL_TEXTURE_2D, 0);

    glBindFramebuffer(GL_FRAMEBUFFER, currentFBO);
}

I run this code in a loop (about 200-500) per touch event. 我为每个触摸事件循环运行此代码（大约200-500）。 It runs pretty fast. 它运行非常快。

And this is how I have ported the code to Metal... 这就是我将代码移植到Metal的方式...

- (void) renderBrush:(id<MTLTexture>)brush onTarget:(id<MTLTexture>)target at:(CGPoint)point withCommandBuffer:(id<MTLCommandBuffer>)commandBuffer {

int brushSize = 124;

CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);

rect.origin.x /= target.width;
rect.origin.y /= target.height;
rect.size.width /= target.width;
rect.size.height /= target.height;

Float32 imageVertices[8];
// Calculate the vertices (basically the rectangle that we need to draw) on the target texture that we are going to draw
// We are not drawing on the entire target texture, only on a square around the point
[self composeImageVertices:imageVertices toProjectionRect:rect onImageOfSize:CGSizeMake(target.width, target.height)];

// We use different one vertexBuffer per pass. This is because this is run on a loop and the subsequent calls will overwrite
// The values. Other buffers also get overwritten but that is ok for now, we only need to demonstrate the performance.
id<MTLBuffer> vertexBuffer = [_vertexArray lastObject];

memcpy([vertexBuffer contents], imageVertices, 8 * sizeof(Float32));

id<MTLRenderCommandEncoder> commandEncoder = [commandBuffer renderCommandEncoderWithDescriptor:mRenderPassDescriptor];
commandEncoder.label = @"DrawCE";

[commandEncoder setRenderPipelineState:mPipelineState];

[commandEncoder setVertexBuffer:vertexBuffer offset:0 atIndex:0];
[commandEncoder setVertexBuffer:mBrushTextureBuffer offset:0 atIndex:1];

[commandEncoder setFragmentTexture:brush atIndex:0];
[commandEncoder setFragmentSamplerState:mSampleState atIndex:0];

[commandEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
[commandEncoder endEncoding];

} }

And then run this code in a loop with a single MTLCommandBuffer per touch event like... 然后在每个触摸事件中使用单个MTLCommandBuffer循环运行此代码，例如...

    id<MTLCommandBuffer> commandBuffer = [MetalContext.defaultContext.commandQueue commandBuffer];
commandBuffer.label = @"DrawCB";

dispatch_semaphore_wait(_inFlightSemaphore, DISPATCH_TIME_FOREVER);

mRenderPassDescriptor.colorAttachments[0].texture = target;

__block dispatch_semaphore_t block_sema = _inFlightSemaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
    dispatch_semaphore_signal(block_sema);
}];

_vertexArray = [[NSMutableArray alloc] init];
for (int i = 0; i < strokes; i++) {
    id<MTLBuffer> vertexBuffer = [MetalContext.defaultContext.device newBufferWithLength:8 * sizeof(Float32) options:0];
    [_vertexArray addObject:vertexBuffer];

    id<MTLTexture> brush = [_brushes objectAtIndex:rand()%_brushes.count];
    [self renderBrush:brush onTarget:target at:CGPointMake(x, y) withCommandBuffer:commandBuffer];
    x += deltaX;
    y += deltaY;
}

[commandBuffer commit];

In the sample code which I have attached, I have replaced the touch events with a timer loop to keep things simple. 在我所附的示例代码中，我用计时器循环替换了触摸事件，以使事情变得简单。

On an iPhone 7 Plus, I get 60fps with OpenGL and 15fps with Metal. 在iPhone 7 Plus上，使用OpenGL可获得60fps，而使用Metal可获得15fps。 May be I am doing something horribly wrong here? 可能是我在这里做错了什么吗？

Answer 1

Remove all redundancy: 删除所有冗余：

Don't create buffers at render time. 不要在渲染时创建缓冲区。 Allocate sufficient buffers during initialization. 在初始化期间分配足够的缓冲区。
Don't create a command encoder for every quad. 不要为每个四边形创建命令编码器。
Use one big vertex buffer with different (properly aligned) offsets for each quad. 对于每个四边形，使用一个具有不同（正确对齐）偏移量的大顶点缓冲区。 Use -setVertexBufferOffset:atIndex: to set just the offset as necessary, without changing the buffer. 使用-setVertexBufferOffset:atIndex:仅在需要时设置偏移量，而不更改缓冲区。
composeImageVertices:... can write directly into the vertex buffer with an appropriate cast, avoiding a memcpy . composeImageVertices:...可以通过适当的composeImageVertices:...转换直接写入顶点缓冲区，而无需使用memcpy 。
Depending on what composeImageVertices:... actually does and if deltaX and deltaY are constants, you may be able to set up the vertex buffer once, ever. 取决于composeImageVertices:...实际作用，并且如果deltaX和deltaY是常量，则可能永远可以设置一次顶点缓冲区。 The vertex shader can transform the vertices as necessary. 顶点着色器可以根据需要变换顶点。 You would pass in the appropriate data as uniforms (either the destination point and render target size, or even a transform matrix). 您将以统一的形式（目的地和渲染目标大小，甚至是变换矩阵）传递适当的数据。
Assuming they're the same every time, don't set mPipelineState , mBrushTextureBuffer , and mSampleState every time. 假设每次都相同，则不要每次都设置mPipelineState ， mBrushTextureBuffer和mSampleState 。
If any quads share the same brush texture, group them together and do one draw command to draw them all. 如果任何四边形共享相同的笔刷纹理，请将它们组合在一起并执行一个draw命令将其全部绘制。 This may require switching to triangle primitives instead of triangle strip primitives. 这可能需要切换到三角图元而不是三角带状图元。 However, if you do an indexed draw, you can use the primitive restart sentinel to draw multiple triangle strips in one draw command. 但是，如果执行索引绘制，则可以使用原始的重新启动哨兵在一个绘制命令中绘制多个三角形带。
You can even do multiple brushes in one draw command if the count doesn't exceed the number of textures allowed (31). 如果计数不超过允许的纹理数量，您甚至可以在一个绘制命令中进行多个笔刷（31）。 Pass all of the brush textures to the fragment shader. 将所有笔刷纹理传递到片段着色器。 It can receive them as a texture array. 它可以将它们作为纹理数组接收。 The vertex data would include the brush index, the vertex shader would pass that forward, the fragment shader would use it to look up the texture to sample from the array. 顶点数据将包括笔刷索引，顶点着色器会将其向前传递，片段着色器将使用它来查找纹理以从数组中采样。
You could use instanced drawing to draw everything in a single command. 您可以使用实例化绘图在单个命令中绘制所有内容。 Draw stroke instances of a single quad. 绘制单个四边形的stroke实例。 In the vertex shader, transform the position based on the instance ID. 在顶点着色器中，根据实例ID变换位置。 You would have to pass deltaX and deltaY in as uniform data. 您必须将deltaX和deltaY作为统一数据传递。 The brush indexes can be in a single buffer that's passed in, too, and the shader can look up the brush index in it by the instance ID. 笔刷索引也可以位于传入的单个缓冲区中，并且着色器可以通过实例ID在其中查找笔刷索引。
Have you considered using point primitives instead of quads? 您是否考虑过使用点图元而不是四边形？ That would reduce the number of vertexes and give Metal information that it can used to optimize rasterization. 这样可以减少顶点的数量，并为Metal提供可用于优化栅格化的信息。

与OpenGL相比，金属渲染要慢得多，而在大纹理上渲染小纹理

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-07-14 05:10:19

与OpenGL相比，金属渲染要慢得多，而在大纹理上渲染小纹理

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-07-14 05:10:19

解决方案1
3 已采纳 2018-07-14 05:10:19