为什么用单个 bufferData 调用替换这些矩阵转换要慢得多？

Question

I'm trying to optimize my shader that draws sprites and I originally had something like this:我正在尝试优化绘制精灵的着色器，我最初有这样的东西：

// this matrix will convert from pixels to clip space
var matrix = m3.projection(this.camera.viewportWidth / this.camera.scale, this.camera.viewportHeight / this.camera.scale);

// this matrix will translate our quad to dstX, dstY
matrix = m3.translate(matrix, dstX, dstY);

// this matrix will scale our 1 unit quad
// from 1 unit to texWidth, texHeight units
matrix = m3.scale(matrix, dstWidth, dstHeight);

gl.uniformMatrix3fv(attribs.matrixLocation, false, matrix);

The above code is inspired from this tutorial: https://webglfundamentals.org/webgl/lessons/webgl-2d-drawimage.html上面的代码灵感来自本教程： https://webglfundamentals.org/webgl/lessons/webgl-2d-drawimage.html

Which worked, but I already have my camera matrix transformation saved, so I wanted to avoid having to do all of those matrix transformations each frame.哪个有效，但我已经保存了相机矩阵变换，所以我想避免在每一帧都进行所有这些矩阵变换。 Each of those m3.whatever calls allocates a new array, so I thought to replace it with the following:每个m3.whatever调用都会分配一个新数组，所以我想用以下内容替换它：

gl.bindBuffer(gl.ARRAY_BUFFER, attribs.positionBuffer);

attribs.positionsQuad[0] = dstX;
attribs.positionsQuad[1] = dstY + dstHeight;
attribs.positionsQuad[2] = dstX;
attribs.positionsQuad[3] = dstY;
attribs.positionsQuad[4] = dstX + dstWidth;
attribs.positionsQuad[5] = dstY + dstHeight;

attribs.positionsQuad[6] = dstX + dstWidth;
attribs.positionsQuad[7] = dstY + dstHeight;
attribs.positionsQuad[8] = dstX;
attribs.positionsQuad[9] = dstY;
attribs.positionsQuad[10] = dstX + dstWidth;
attribs.positionsQuad[11] = dstY;

gl.bufferData(gl.ARRAY_BUFFER, attribs.positionsQuad, gl.DYNAMIC_DRAW);
gl.vertexAttribPointer(attribs.positionLocation, 2, gl.FLOAT, false, 0, 0);

gl.uniformMatrix3fv(attribs.matrixLocation, false, camera.ClipTransform);

Which also works, but now my frame-rate is very spikey.这也有效，但现在我的帧速率非常高。 Does anyone know why this is?有人知道为什么吗？ I tried profiling it and it indeed says that my image drawing shader is now slower, but I'm not sure how this could be.我尝试分析它，它确实说我的图像绘图着色器现在变慢了，但我不确定这是怎么回事。 I replaced a bunch of matrix allocations and transformations with writing to a single pre-allocated array and then transferring that, and now it's much slower?我用写入单个预分配数组然后传输它替换了一堆矩阵分配和转换，现在它慢得多？

It seems that a lot of the frame rate jumps may be due to the garbage collector running, but even this doesn't make sense to me.似乎很多帧速率跳跃可能是由于垃圾收集器的运行，但即使这对我来说也没有意义。 With the initial solution, there should have been so much more garbage, considering I'm allocating and throwing away a ton of arrays each frame with all those matrix transformations.使用最初的解决方案，应该有更多的垃圾，考虑到我在所有这些矩阵变换的每一帧中分配和丢弃大量的 arrays。 And now I'm not allocating at all, so why would GC usage spike now?现在我根本不分配，那为什么现在 GC 使用量会飙升？

Is there a better way to accomplish this?有没有更好的方法来实现这一点？ I've uploaded my entire shader here for reference: https://pastebin.com/tdCYpDqv我在这里上传了我的整个着色器以供参考： https://pastebin.com/tdCYpDqv

Answer 1

For most graphics API commands what happens is that the command is encoded in a command-buffer, at some point (asynchronously) those buffers are synchronized to the GPU by the graphics driver.对于大多数图形 API 命令，发生的情况是命令被编码在命令缓冲区中，在某些时候（异步）这些缓冲区由图形驱动程序同步到 GPU。 For a command buffer to be predictable all data needs to be copied to be put into the buffer.为了使命令缓冲区可预测，所有数据都需要复制以放入缓冲区。

Now one problem with your code is that you're setting the data and immediately ask the GPU to draw from it, requiring a hard sync of the complete buffer.现在您的代码的一个问题是您正在设置数据并立即要求 GPU 从中提取数据，这需要对完整缓冲区进行硬同步。 The driver expects uniforms to need syncing but not necessarily array buffers, the usage hints ( DYNAMIC , STREAM and STATIC draw) don't really do much about that (actually in most cases STATIC_DRAW is faster even for dynamic data).驱动程序希望制服需要同步，但不一定需要数组缓冲区，使用提示（ DYNAMIC 、 STREAM和STATIC绘制）并没有真正做到这一点（实际上在大多数情况下，即使对于动态数据， STATIC_DRAW也更快）。

When these hard syncs happen you're almost always stalling the pipeline, meaning the GPU needs to wait for all the data to be transferred before it can continue doing whatever it was doing.当这些硬同步发生时，您几乎总是会停止流水线，这意味着 GPU 需要等待所有数据传输完毕，然后才能继续执行它正在执行的操作。 You can avoid this by utilizing double or even triple buffering (write data for the next frame but render current one etc.).您可以通过使用双重甚至三重缓冲（为下一帧写入数据但渲染当前帧等）来避免这种情况。

However with all this being said, trying to optimize the draw of 6 quads is very problematic as (in this context) we're talking about immeasurable differences here, changing one thing over the other might change the frame-time but it doesn't say anything about scalability as you're really just measuring the (often static) overhead rather than the actual performance.然而，尽管如此，试图优化 6 个四边形的绘制是非常有问题的，因为（在这种情况下）我们在这里谈论的是不可估量的差异，改变一件事可能会改变帧时间，但它不会谈论可伸缩性，因为您实际上只是在测量（通常是静态的）开销而不是实际性能。

为什么用单个 bufferData 调用替换这些矩阵转换要慢得多？

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-04-27 12:52:02

为什么用单个 bufferData 调用替换这些矩阵转换要慢得多？

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-04-27 12:52:02

解决方案1
2 已采纳 2021-04-27 12:52:02