优化计算着色器

Question

I have been doing a lot of different computations in compute shaders in OpenGL for the last couple of months.在过去的几个月里，我一直在 OpenGL 的计算着色器中做很多不同的计算。 Some work fine, others are slow, some I could optimize somewhat, others again I could not optimize whatsoever.有些工作正常，有些很慢，有些我可以稍微优化，有些我无法优化。

I have been playing around with the simple code below (gravitational forces between n particles), just to find some strategies on how to increase performance in general, but absolutely nothing works:我一直在玩弄下面的简单代码（ n粒子之间的引力），只是为了找到一些关于如何提高总体性能的策略，但绝对没有任何效果：

#version 450 core

uniform uint NumParticles;

layout (std430, binding = 0) buffer bla
{
    double rIn[];
};

layout (std430, binding = 1) writeonly buffer bla2
{
    double aOut[];
};


layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;


void main()
{
    int n;
    double dist3, dist2;
    dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);

    a.x = a.y = a.z = 0;
    for (n = 0; n < NumParticles; n++)
    {
        if (n != gl_GlobalInvocationID.x)
        {
            diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
            dist2 = dot(diff, diff);
            dist3 = 1.0 / (sqrt(dist2) * dist2);
            a += diff * dist3;
        }
    }
    aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
    aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
    aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}

I have the strong suspicion that it is a lot of memory access that slows down this code.我强烈怀疑是大量的 memory 访问减慢了这段代码。 So one thing I tried was making a shared variable as a "buffer" and let the first thread ( gl_LocalInvocationID.x == 0 ) read the first (for example) 1024 particles, let all threads do their calculations, then the next 1024, ect.所以我尝试的一件事是将共享变量作为“缓冲区”并让第一个线程（ gl_LocalInvocationID.x == 0 ）读取第一个（例如）1024 个粒子，让所有线程进行计算，然后是下一个 1024，等。 This slowed the code down by a factor of 2-3.这将代码减慢了 2-3 倍。 Another thing I tried, was putting the particle-coordinates in a uniform array (which only works for up to 1024 particles and I use a lot more - so this was just to see, if it made a difference), which changed absolutely nothing.我尝试的另一件事是将粒子坐标放在一个统一的数组中（它只适用于最多 1024 个粒子，而且我使用了更多 - 所以这只是为了看看它是否有所作为），这绝对没有改变。

I can provide some code for the above examples, but I don't think, this would be helpful.我可以为上述示例提供一些代码，但我认为这不会有帮助。

I know there are minor improvements one could make (like using inversesqrt instead of 1.0 / sqrt , not computing particle n <-> particle m when m <-> n is already computed...), but I would be interested in a general approach for compute shaders.我知道可以做一些小的改进（比如使用inversesqrt而不是1.0 / sqrt ，当m <-> n已经被计算时不计算粒子 n <->粒子 m ...），但我会对一般的感兴趣计算着色器的方法。

So can anybody give me any hints for how I could improve performance for this code?那么任何人都可以给我任何关于如何提高这段代码性能的提示吗？ I couldn't really find anything online on how to improve performance of compute shaders, so any general advice (not necessarily just for this code) would be appreciated.我真的无法在网上找到任何关于如何提高计算着色器性能的信息，因此任何一般性建议（不一定仅针对此代码）将不胜感激。

Answer 1

This operation as defined doesn't seem like a good one for GPU parallelism.这种定义的操作对于 GPU 并行性来说似乎不是一个好的操作。 It's very hungry in terms of memory accesses, as complete processing for one particle requires reading the data for every other particle in the system.就 memory 次访问而言，它非常饥饿，因为对一个粒子的完整处理需要读取系统中每个其他粒子的数据。

If you want to keep the algorithm as is, you can implement it more optimally.如果你想保持算法不变，你可以更优化地实现它。 As it stands, each work item does all of the processing for a particular particle all at once.就目前而言，每个工作项都会一次性完成对特定粒子的所有处理。 That's a huge number of memory operations happening all at once.这是同时发生的大量 memory 操作。

Instead, split your particles into blocks, sized for a work group.相反，将您的粒子分成块，大小适合工作组。 Each work group operates on a block of source particles and a block of test particles (which may be the same block).每个工作组对源粒子块和测试粒子块（可能是同一块）进行操作。 The test particles should be loaded into shared memory, so each work group can repeatedly read test data quickly.测试粒子加载到shared memory，这样每个工作组都可以快速重复读取测试数据。 So a single work group only does a portion of the tests for each block of source particles.因此，一个工作组只对每个源粒子块进行一部分测试。

The big difficulty now is writing the data.现在最大的困难是写数据。 Since multiple work groups are potentially be writing the added forces to the same source particles, you need to use some mechanism to either atomically increment the source particle data or write the data to a temporary memory buffer.由于多个工作组可能会将添加的力写入相同的源粒子，因此您需要使用某种机制以原子方式递增源粒子数据或将数据写入临时 memory 缓冲区。 A second compute shader process can run over the temporary buffers and combine the data in a reduction process.第二个计算着色器进程可以在临时缓冲区上运行并在缩减过程中组合数据。

优化计算着色器

问题描述

1 个解决方案

解决方案1
2 2022-04-20 14:20:28

优化计算着色器

问题描述

1 个解决方案

解决方案1 2 2022-04-20 14:20:28

解决方案1
2 2022-04-20 14:20:28