[英]Metal much slower compared to OpenGL while rendering small textures on a large texture
I am trying to migrate my projects from OpenGL to Metal on iOS. 我正在尝试将项目从OpenGL迁移到iOS上的Metal。 But I seem to have hit a performance wall.
但是我似乎已经碰壁了。 The task is simple...
任务很简单...
I have a large texture (more than 3000x3000 pixels). 我的纹理很大(超过3000x3000像素)。 On which I need to draw several (a few hundreds) small textures (say 124x124) on each touchesMoved event.
在每个touchesMoved事件上,我需要在其上绘制几个(数百个)小纹理(例如124x124)。 And this is while enabling a particular blending function.
这是在启用特定混合功能的同时。 It is basically like a paint brush.
它基本上就像一个油漆刷。 And then display the large texture.
然后显示大的纹理。 This is roughly the task is.
这大致就是任务。
On OpenGL it runs pretty fast. 在OpenGL上,它运行非常快。 I get around 60fps.
我大约达到60fps。 When I port the same code to Metal, I could manage to get only 15fps.
当我将相同的代码移植到Metal时,我只能设法获得15fps。
I have created two sample projects with bare minimum to demonstrate the problem. 我已经创建了两个示例项目,几乎没有演示这个问题。 Here are the projects (Both OpenGL and Metal)...
这是项目(OpenGL和Metal)...
https://drive.google.com/file/d/12MPt1nMzE2UL_s4oXEUoTCXYiTz42r4b/view?usp=sharing https://drive.google.com/file/d/12MPt1nMzE2UL_s4oXEUoTCXYiTz42r4b/view?usp=sharing
This is roughly what I do in OpenGL... 这大致就是我在OpenGL中所做的...
- (void) renderBrush:(GLuint)brush on:(GLuint)fbo ofSize:(CGSize)size at:(CGPoint)point {
GLfloat brushCoordinates[] = {
0.0f, 0.0f,
1.0f, 0.0f,
0.0f, 1.0f,
1.0f, 1.0f,
};
GLfloat imageVertices[] = {
-1.0f, -1.0f,
1.0f, -1.0f,
-1.0f, 1.0f,
1.0f, 1.0f,
};
int brushSize = 124;
CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);
rect.origin.x /= size.width;
rect.origin.y /= size.height;
rect.size.width /= size.width;
rect.size.height /= size.height;
[self convertImageVertices:imageVertices toProjectionRect:rect onImageOfSize:size];
int currentFBO;
glGetIntegerv(GL_FRAMEBUFFER_BINDING, ¤tFBO);
[_Program use];
glBindFramebuffer(GL_FRAMEBUFFER, fbo);
glViewport(0, 0, (int)size.width, (int)size.height);
glActiveTexture(GL_TEXTURE2);
glBindTexture(GL_TEXTURE_2D, brush);
glUniform1i(brushTextureLocation, 2);
glVertexAttribPointer(positionLocation, 2, GL_FLOAT, 0, 0, imageVertices);
glVertexAttribPointer(brushCoordinateLocation, 2, GL_FLOAT, 0, 0, brushCoordinates);
glEnable(GL_BLEND);
glBlendEquation(GL_FUNC_ADD);
glBlendFuncSeparate(GL_ONE, GL_ZERO, GL_ONE, GL_ONE);
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
glDisable(GL_BLEND);
glActiveTexture(GL_TEXTURE2);
glBindTexture(GL_TEXTURE_2D, 0);
glBindFramebuffer(GL_FRAMEBUFFER, currentFBO);
}
I run this code in a loop (about 200-500) per touch event. 我为每个触摸事件循环运行此代码(大约200-500)。 It runs pretty fast.
它运行非常快。
And this is how I have ported the code to Metal... 这就是我将代码移植到Metal的方式...
- (void) renderBrush:(id<MTLTexture>)brush onTarget:(id<MTLTexture>)target at:(CGPoint)point withCommandBuffer:(id<MTLCommandBuffer>)commandBuffer {
int brushSize = 124;
CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);
rect.origin.x /= target.width;
rect.origin.y /= target.height;
rect.size.width /= target.width;
rect.size.height /= target.height;
Float32 imageVertices[8];
// Calculate the vertices (basically the rectangle that we need to draw) on the target texture that we are going to draw
// We are not drawing on the entire target texture, only on a square around the point
[self composeImageVertices:imageVertices toProjectionRect:rect onImageOfSize:CGSizeMake(target.width, target.height)];
// We use different one vertexBuffer per pass. This is because this is run on a loop and the subsequent calls will overwrite
// The values. Other buffers also get overwritten but that is ok for now, we only need to demonstrate the performance.
id<MTLBuffer> vertexBuffer = [_vertexArray lastObject];
memcpy([vertexBuffer contents], imageVertices, 8 * sizeof(Float32));
id<MTLRenderCommandEncoder> commandEncoder = [commandBuffer renderCommandEncoderWithDescriptor:mRenderPassDescriptor];
commandEncoder.label = @"DrawCE";
[commandEncoder setRenderPipelineState:mPipelineState];
[commandEncoder setVertexBuffer:vertexBuffer offset:0 atIndex:0];
[commandEncoder setVertexBuffer:mBrushTextureBuffer offset:0 atIndex:1];
[commandEncoder setFragmentTexture:brush atIndex:0];
[commandEncoder setFragmentSamplerState:mSampleState atIndex:0];
[commandEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
[commandEncoder endEncoding];
} }
And then run this code in a loop with a single MTLCommandBuffer per touch event like... 然后在每个触摸事件中使用单个MTLCommandBuffer循环运行此代码,例如...
id<MTLCommandBuffer> commandBuffer = [MetalContext.defaultContext.commandQueue commandBuffer];
commandBuffer.label = @"DrawCB";
dispatch_semaphore_wait(_inFlightSemaphore, DISPATCH_TIME_FOREVER);
mRenderPassDescriptor.colorAttachments[0].texture = target;
__block dispatch_semaphore_t block_sema = _inFlightSemaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
dispatch_semaphore_signal(block_sema);
}];
_vertexArray = [[NSMutableArray alloc] init];
for (int i = 0; i < strokes; i++) {
id<MTLBuffer> vertexBuffer = [MetalContext.defaultContext.device newBufferWithLength:8 * sizeof(Float32) options:0];
[_vertexArray addObject:vertexBuffer];
id<MTLTexture> brush = [_brushes objectAtIndex:rand()%_brushes.count];
[self renderBrush:brush onTarget:target at:CGPointMake(x, y) withCommandBuffer:commandBuffer];
x += deltaX;
y += deltaY;
}
[commandBuffer commit];
In the sample code which I have attached, I have replaced the touch events with a timer loop to keep things simple. 在我所附的示例代码中,我用计时器循环替换了触摸事件,以使事情变得简单。
On an iPhone 7 Plus, I get 60fps with OpenGL and 15fps with Metal. 在iPhone 7 Plus上,使用OpenGL可获得60fps,而使用Metal可获得15fps。 May be I am doing something horribly wrong here?
可能是我在这里做错了什么吗?
Remove all redundancy: 删除所有冗余:
-setVertexBufferOffset:atIndex:
to set just the offset as necessary, without changing the buffer. -setVertexBufferOffset:atIndex:
仅在需要时设置偏移量,而不更改缓冲区。 composeImageVertices:...
can write directly into the vertex buffer with an appropriate cast, avoiding a memcpy
. composeImageVertices:...
可以通过适当的composeImageVertices:...
转换直接写入顶点缓冲区,而无需使用memcpy
。 composeImageVertices:...
actually does and if deltaX
and deltaY
are constants, you may be able to set up the vertex buffer once, ever. composeImageVertices:...
实际作用,并且如果deltaX
和deltaY
是常量,则可能永远可以设置一次顶点缓冲区。 The vertex shader can transform the vertices as necessary. mPipelineState
, mBrushTextureBuffer
, and mSampleState
every time. mPipelineState
, mBrushTextureBuffer
和mSampleState
。 stroke
instances of a single quad. stroke
实例。 In the vertex shader, transform the position based on the instance ID. deltaX
and deltaY
in as uniform data. deltaX
和deltaY
作为统一数据传递。 The brush indexes can be in a single buffer that's passed in, too, and the shader can look up the brush index in it by the instance ID.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.