Vulkan：统一缓冲区的奇怪性能

Question

One of the inputs of my fragment shader is an array of 5 structures.我的片段着色器的输入之一是包含 5 个结构的数组。 The shader computes a color based on each of the 5 structures.着色器根据 5 个结构中的每一个计算颜色。 In the end, these 5 colors are summed together to produce the final output.最后，将这 5 种颜色相加以产生最终输出。 The total size of the array is 1440 bytes.数组的总大小为 1440 字节。 To accommodate the alignment of the uniform buffer, the size of the uniform buffer changes to 1920 bytes.为了适应统一缓冲区的对齐，统一缓冲区的大小更改为 1920 字节。

1- If I define the array of 5 structures as a uniform buffer array, the rendering takes 5ms (measured by Nsight Graphics). 1-如果我将 5 个结构的数组定义为统一缓冲区数组，则渲染需要 5ms（由 Nsight Graphics 测量）。 The uniform buffer's memory property is 'VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |统一缓冲区的内存属性是 'VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT'. VK_MEMORY_PROPERTY_HOST_COHERENT_BIT'。 The uniform buffer in glsl is defined as follows glsl中的uniform buffer定义如下

layout(set=0,binding=0) uniform UniformStruct { A a; } us[];

layout(location=0) out vec4 c;
    
void main() 
{
    vec4 col = vec4(0); 
    for (int i = 0; i < 5; i++)   
      col += func(us[nonuniformEXT(i)]);
    c = col;
}

Besides, I'm using 'GL_EXT_nonuniform_qualifier' extension to access the uniform buffer array.此外，我正在使用“GL_EXT_nonuniform_qualifier”扩展来访问统一缓冲区数组。 This seems the most straightforward way for me but there are alternative implementations.这对我来说似乎是最直接的方法，但还有其他实现方式。

2- I can split the rendering from one vkCmdDraw to five vkCmdDraw, change the framebuffer's blend mode from overwriting to addition and define a uniform buffer instead of a uniform buffer array in the fragment shader. 2- 我可以将渲染从一个 vkCmdDraw 拆分为五个 vkCmdDraw，将帧缓冲区的混合模式从覆盖更改为添加，并在片段着色器中定义一个统一缓冲区而不是统一缓冲区数组。 On the CPU side, I change the descriptor type from UNIFORM_BUFFER to UNIFORM_BUFFER_DYNAMICS.在 CPU 端，我将描述符类型从 UNIFORM_BUFFER 更改为 UNIFORM_BUFFER_DYNAMICS。 Before each vkCmdDraw, I bind the dynamic uniform buffer and the corresponding offsets.在每次 vkCmdDraw 之前，我都会绑定动态统一缓冲区和相应的偏移量。 In the fragment shader, the for loop is removed.在片段着色器中，for 循环被移除。 Although it seems that it should be slower than the first method, it is surprisingly much faster than the first method.虽然看起来应该比第一种方法慢，但出乎意料地比第一种方法快得多。 The rendering only takes 2ms total for 5 draws. 5 次绘制的渲染总共只需要 2 毫秒。

3- If I define the array of 5 structures as a storage buffer and do one vkCmdDraw, the rendering takes only 1.4ms. 3- 如果我将 5 个结构的数组定义为存储缓冲区并执行一个 vkCmdDraw，则渲染只需 1.4 毫秒。 In other words, if I change the array from the uniform buffer array to storage buffer but keep anything else the same as 1, it becomes faster.换句话说，如果我将数组从统一缓冲区数组更改为存储缓冲区，但保持其他任何内容与 1 相同，它会变得更快。

4- If I define the array of 5 structures as a global constant in the glsl and do one vkCmdDraw, the rendering takes only 0.5ms. 4- 如果我在 glsl 中将 5 个结构的数组定义为全局常量并执行一个 vkCmdDraw，则渲染只需 0.5 毫秒。

In my opinion, 4 should be the fastest way, which is true in the test.在我看来，4应该是最快的方式，在测试中确实如此。 Then 1 should be the next.那么 1 应该是下一个。 Both 2 and 3 should be slower than 1. However, Neither 2 or 3 is slower than 1. In contrast, they are much faster than 1. Any ideas why using uniform buffer array slows down the rendering? 2 和 3 都应该比 1 慢。但是，2 和 3 都不比 1 慢。相比之下，它们比 1 快得多。为什么使用统一缓冲区数组会减慢渲染速度？ Is it because it is a host visible buffer?是因为它是主机可见缓冲区吗？

Answer 1

When it comes to UBOs, there are two kinds of hardware: the kind where UBOs are specialized hardware and the kind where they aren't.谈到 UBO 时，有两种硬件：UBO 是专用硬件的类型和非专用硬件的类型。 For GPUs where UBOs are not specialized hardware, a UBO is really just a readonly SSBO.对于 UBO 不是专用硬件的 GPU，UBO 实际上只是一个readonly SSBO。 You can usually tell the difference because hardware where UBOs are specialized will have different size limits on them from those of SSBOs.您通常可以分辨出差异，因为 UBO 专用的硬件对它们的大小限制与 SSBO 的大小限制不同。

For specialized hardware-based UBOs (which NVIDIA still uses, if I recall correctly), each UBO represents an upload from memory into a big block of constant data that all invocations of a particular shader stage can access.对于基于硬件的专用 UBO（如果我没记错的话，NVIDIA 仍在使用），每个 UBO 表示从内存上传到一大块常量数据，特定着色器阶段的所有调用都可以访问这些数据。

For this kind of hardware, an array of UBOs is basically creating an array out of segments of this block of constant data.对于这种硬件，一组 UBO 基本上是从这个常量数据块的段中创建一个数组。 And some hardware has multiple blocks of constant data, so indexing then with non-constant expressions is tricky.并且某些硬件具有多个常量数据块，因此使用非常量表达式进行索引是很棘手的。 This is why non-constant access to such indices is an optional feature of Vulkan.这就是为什么对此类索引的非持续访问是 Vulkan 的可选功能。

By contrast, a UBO which contains an array is just one big UBO.相比之下，包含数组的 UBO 只是一个大 UBO。 It's special only in how big it is.它的特殊之处仅在于它有多大。 Indexing through an array within a UBO is no different from indexing any array.通过 UBO 中的数组进行索引与对任何数组进行索引没有什么不同。 There are no special rules with regard to the uniformity of the index of such accesses.对于此类访问的索引的一致性没有特殊规则。

So stop using an array of UBOs and just use a single UBO which contains an array of data:因此，停止使用 UBO 数组，而只使用包含数据数组的单个UBO：

layout(set=0,binding=0) uniform UniformStruct { A a[5]; } us;

It'll also avoid additional padding due to alignment, additional descriptors, additional buffers, etc.它还可以避免由于对齐、额外的描述符、额外的缓冲区等而产生的额外填充。

However, you might also speed things up by not lying to Vulkan.但是，您也可以通过不对 Vulkan 撒谎来加快处理速度。 The expression nonuniformEXT(i) states that the expression i is not dynamically uniform .表达式nonuniformEXT(i)表明表达式i不是动态统一的。 This is incorrect.这是不正确的。 Every shader invocation that executes this loop will generate i expressions that have values from 0 to 4. Every dynamic instance of the expression i for any invocation will have the same value at that place in the code as every other.执行该循环将产生每着色器调用i具有从0到4的值的表达式表达的每个动态实例i任何调用将在代码作为每隔那个地方相同的值。

Therefore i is dynamically uniform, so telling Vulkan that it isn't is not helpful.因此i是动态统一的，所以告诉 Vulkan 它不是没有帮助。

Vulkan：统一缓冲区的奇怪性能

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-09-19 21:30:59

Vulkan：统一缓冲区的奇怪性能

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-09-19 21:30:59

解决方案1
2 已采纳 2020-09-19 21:30:59