为什么与小阵列相比，大阵列的C＃SIMD的性能增益较低？

Question

I've been working on a Deep Learning Library writing on my own. 我一直在自己编写深度学习库。 In matrix operations, getting the best performance is a key for me. 在矩阵运算中，获得最佳性能是我的关键。 I've been researching about programming languages and their performances on numeric operations. 我一直在研究编程语言及其在数字运算中的性能。 After a while, I found that C# SIMD has very similar performance with C++ SIMD . 一段时间后，我发现C＃SIMD与C ++ SIMD具有非常相似的性能。 So, I decided to write the library in C#. 因此，我决定用C＃编写该库。

Firstly, I tested C# SIMD (I tested a lot of things, however not gonna write here). 首先，我测试了C＃SIMD （我测试了很多东西，但是这里不写）。 I noticed that it worked a lot better when using smaller arrays . 我注意到当使用较小的数组时 ，它的效果要好得多。 The efficiency not good when using bigger arrays . 使用更大的数组时效率不高。 I think it is ridiculous. 我认为这很荒谬。 Normally things work faster in terms of efficiency when they are bigger. 通常情况下，当事情变大时，它们的工作效率会提高。

My question is "Why does vectorization work slower working with bigger arrays in C#?" 我的问题是“为什么向量化在C＃中使用较大的数组工作时会变慢？”

I am going to share benchmarks (done by myself) using BenchmarkNet . 我将使用BenchmarkNet分享基准测试（由我自己完成）。

Program.Size = 10

| Method |      Mean |     Error |    StdDev |
|------- |----------:|----------:|----------:|
|     P1 |  28.02 ns | 0.5225 ns | 0.4888 ns |
|     P2 | 154.15 ns | 1.1220 ns | 0.9946 ns |
|     P3 | 100.88 ns | 0.8863 ns | 0.8291 ns |

Program.Size = 10000

| Method |     Mean |    Error |   StdDev |   Median |
|------- |---------:|---------:|---------:|---------:|
|     P1 | 142.0 ms | 3.065 ms | 8.989 ms | 139.5 ms |
|     P2 | 170.3 ms | 3.365 ms | 5.981 ms | 170.1 ms |
|     P3 | 103.3 ms | 2.400 ms | 2.245 ms | 102.8 ms |

So as you see I increase the size 1000 times , meaning increasing the size of arrays 1000000 times . 如您所见，我将大小增加了1000倍 ，这意味着将数组的大小增加了1000000倍 。 P2 took 154 ns at first. P2最初花费了154 ns 。 At second test, It took 170 ms which is what we expected 1000-ish times more. 在第二次测试中， 花费了170毫秒 ，这是我们预期的1000倍以上。 Also, P3 took exactly 1000 times more (100ns - 100ms) However, what I wanna touch here is that P1 which is vectorized loop has significantly lower performance than before . 同样，P3花费的时间正好多了1000倍（100ns-100ms）。但是，我想在这里碰到的是， 向量化循环的P1的性能明显低于以前 。 I wonder why. 我想知道为什么。

Note that P3 is independent from this topic. 请注意，P3与该主题无关。 P1 is the vectorized version of P2. P1是P2的向量化版本。 So, we can say the efficiency of vectorization is P2/P1 in terms of the time they took. 因此，就其花费的时间而言，我们可以说向量化的效率为P2 / P1。 My code is like below: 我的代码如下：

Matrix class: 矩阵类：

public sealed class Matrix1
{
    public float[] Array;
    public int D1, D2;
    const int size = 110000000;
    private static ArrayPool<float> sizeAwarePool = ArrayPool<float>.Create(size, 100);

    public Matrix1(int d1, int d2)
    {
        D1 = d1;
        D2 = d2;
        if(D1*D2 > size)
        { throw new Exception("Size!"); }
        Array = sizeAwarePool.Rent(D1 * D2);
    }

    bool Deleted = false;
    public void Dispose()
    {
        sizeAwarePool.Return(Array);
        Deleted = true;
    }

    ~Matrix1()
    {
        if(!Deleted)
        {
            throw new Exception("Error!");
        }
    }

    public float this[int x, int y]
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            return Array[x * D2 + y];
        }
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        set
        {
            Array[x * D2 + y] = value;
        }
    }
}

Program Class: 程序类别：

public class Program
{
    const int Size = 10000;

    [Benchmark]
    public void P1()
    {
        Matrix1 a = Program.a, b = Program.b, c = Program.c;
        int sz = Vector<float>.Count;
        for (int i = 0; i < Size * Size; i += sz)
        {
            var v1 = new Vector<float>(a.Array, i);
            var v2 = new Vector<float>(b.Array, i);
            var v3 = v1 + v2;
            v3.CopyTo(c.Array, i);
        }

    }

    [Benchmark]
    public void P2()
    {
        Matrix1 a = Program.a, b = Program.b, c = Program.c;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                c[i, j] = a[i, j] + b[i, j];
    }
    [Benchmark]
    public void P3()
    {
        Matrix1 a = Program.a;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                a[i, j] = i + j - j; 
                //could have written a.Array[i*size + j] = i + j
                //but it would have made no difference in terms of performance.
                //so leave it that way
    }


    public static Matrix1 a = new Matrix1(Size, Size);
    public static Matrix1 b = new Matrix1(Size, Size);
    public static Matrix1 c = new Matrix1(Size, Size);

    static void Main(string[] args)
    {
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                a[i, j] = i;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                b[i, j] = j;
        for (int i = 0; i < Size; i++)  
            for (int j = 0; j < Size; j++)
                c[i, j] = 0;

        var summary = BenchmarkRunner.Run<Program>();
        a.Dispose();
        b.Dispose();
        c.Dispose();
    }
}

I assure you that x[i,j] doesn't affect the performance. 我向您保证， x[i,j]不会影响性能。 Same as using x.Array[i*Size + j] 与使用x.Array[i*Size + j]

Answer 1

This might not be the whole story: the OP reports in comments that they sped up P1 from 140 to 120 ms with jagged arrays. 这可能不是全部内容：OP 在注释中报告说，它们使用锯齿状数组将P1从140毫秒加速到120毫秒。

So maybe something extra is holding it back in the large case. 因此，在大型情况下，可能还会有一些额外的限制。 I'd use performance counters to investigate and check for ld_blocks_partial.address_alias (4k aliasing -> false dependency of loads on stores). 我将使用性能计数器来调查并检查ld_blocks_partial.address_alias （4k别名->对商店负载的虚假依赖性）。 And/or look at the memory addresses you get from C# allocators and maybe see if they're close to but not quite all the same alignment relative to a 4k boundary. 和/或查看从C＃分配器获得的内存地址，也许看看它们相对于4k边界是否接近但并非完全相同。

I don't think needing 3 hot cache lines in the same set would be a problem; 我不认为在同一组中需要3条热缓存行会成为问题。 L1d is 8-way associative on any CPU that would give >4x speedups with AVX (ie with 256-bit load/store and ALUs). L1d在任何使用AVX（即具有256位加载/存储和ALU）的CPU上提供大于4倍的加速的CPU上都是8路关联的。 But if all your arrays have the same alignment relative to a 4k boundary, they will all alias the same set in a 32kiB L1d cache when you access the same index. 但是，如果所有数组相对于4k边界具有相同的对齐方式，则当您访问相同的索引时，它们都将对32kiB L1d缓存中的同一集合进行别名。

Oh, here's a theory: Jagged arrays stagger the page walks , instead of all 3 streams (2 src 1 dst) reaching a new page at the same time and all having a TLB miss that requires a walk. 哦，这是个理论：锯齿状的阵列错开了页面的走动 ，而不是所有3个流（2 src 1 dst）同时到达一个新页面，并且所有的流都有需要走动的TLB未命中。 Try making sure your code uses 2M hugepages instead of just 4k to reduce TLB misses. 尝试确保您的代码使用2M大页面而不是4k，以减少TLB丢失。 (eg on Linux you'd use a madvise(buf, size, MADV_HUGEPAGE) system call.) （例如，在Linux上，您将使用madvise(buf, size, MADV_HUGEPAGE)系统调用。）

Check performance counter events for dtlb_load_misses.miss_causes_a_walk and/or dtlb_load_misses.stlb_hit . 检查dtlb_load_misses.miss_causes_a_walk和/或dtlb_load_misses.stlb_hit性能计数器事件。 There is TLB prefetch so having them staggered can allow TLB prefetch to work on one or two in parallel instead of getting hit with all 3 page walks at once. 有TLB预取功能，因此将它们交错排列可以使TLB预取功能可以并行处理一两个，而不必一次被全部3个页面遍历所击中。

Large sizes bottleneck on memory bandwidth, not just ALU 内存带宽的大瓶颈，而不仅仅是ALU

SIMD doesn't increase available memory bandwidth, just how quickly you can get data in/out of cache . SIMD不会增加可用的内存带宽，而不会增加数据进/出缓存的速度 。 It increases how much memory bandwidth you can actually use most of the time. 它增加了您大部分时间实际可以使用的内存带宽。 Doing the same work in fewer instructions can help OoO exec see farther ahead and detect TLB misses sooner, though. 不过，以较少的指令执行相同的工作可以帮助OoO执行人员看到更远的距离，并更快地检测到TLB丢失。

The speedup is with large arrays is limited because scalar is already close-ish to bottlenecked on main memory bandwidth. 大型阵列的加速受到限制，因为标量已经接近主内存带宽的瓶颈。 Your C[i] = A[i]+B[i] access pattern is the STREAM sum access pattern , maximal memory access for one ALU operation. 您的C[i] = A[i]+B[i]访问模式是STREAM sum访问模式，即一个ALU操作的最大内存访问。 (1D vs. 2D indexing is irrelevant, you're still just reading / writing contiguous memory and doing pure vertical SIMD float addition. Explicitly in the P1 case.) （1D与2D索引无关紧要，您仍然只是读/写连续内存并进行纯垂直SIMD float加法。在P1情况下是明确的。）

With small matrices (10x10 = 100 float = 400 bytes * (2 sources + 1 dst) = 1.2kB), your data can stay hot in L1d cache so cache misses won't bottleneck your SIMD loop. 使用小型矩阵 （10x10 = 100 float = 400字节*（2个源+ 1 dst）= 1.2kB）， 您的数据可以在L1d缓存中保持高温，因此缓存未命中不会成为SIMD循环的瓶颈。

With your src + dst hot in L1d cache, you can get close to the full 8x speedup over scalar AVX with 8x 32-bit elements per vector, assuming a Haswell or later CPU that has peak load+store throughput of 2x 32-byte vectors loads + 1x 32-byte vector store per clock cycle. 使用L1d高速缓存中的src + dst hot，您可以达到标量AVX的完整8倍加速，每个矢量具有8个32位元素，假设Haswell或更高版本的CPU具有2x 32字节矢量的峰值负载+存储吞吐量每个时钟周期加载+ 1x 32字节向量存储。

In practice you got 154.15 / 28.02 = ~5.5 for the small-matrix case. 实际上，对于小矩阵情况，您得到154.15 / 28.02 = ~5.5 。

Actual cache limitations apparently preclude that, eg Intel's optimization manual lists ~81 bytes / clock cycle typical sustained load + store bandwidth for Skylake's L1d cache. 实际的高速缓存限制显然排除了这一点，例如，英特尔的优化手册列出了约81个字节/时钟周期的典型持续负载+ Skylake L1d高速缓存的存储带宽。 But with GP-integer loads + stores, Skylake can sustain 2 loads + 1 store per cycle for 32-bit operand-size, with the right loop. 但是，使用GP整数加载+存储，对于32位操作数大小，通过正确的循环，Skylake每个周期可以维持2个加载+ 1存储。 So there's some kind of microarchitectural limit other than load/store uop throughput that slows down vector load/store somewhat. 因此，除了加载/存储uop吞吐量之外，还有某种微体系结构限制，这会稍微降低矢量加载/存储的速度。

You didn't say what hardware you have, but I'm guessing it's Intel Haswell or later. 您没有说您拥有什么硬件，但我猜它是Intel Haswell或更高版本。 "Only" 5.5x speedup might be due to benchmark overhead for only doing 12 or 13 loop iterations per call. “仅” 5.5倍加速可能是由于每次调用仅执行12或13次循环迭代时的基准开销。

(100 elements / 8 elem/vec = 12.5. So 12 if you leave the last 4 elements not done, or 13 if you overread by 4 because your loop condition isn't i < Size * Size - sz + 1 ) （100个元素/ 8个elem / vec = 12.5。因此，如果最后4个元素未完成，则为12；如果由于循环条件不是i < Size * Size - sz + 1而由于4被超读，则为13）

Zen's 2x 16-byte memory ops per clock (up to one of which can be a store) would slow down both scalar and AVX equally. Zen每时钟2x 16字节的内存操作数（最多可以存储一个）将同时降低标量和AVX的速度。 But you'd still get at best 4x speedup going from 1 element per vector with movss / addss xmm, mem / movss to the same uops doing 4 elements at once. 但是从movss每个向量1个元素/ addss xmm, mem / movss到同一movss同时执行4个元素的情况下，您仍然可以获得最高4倍的加速。 Using 256-bit instructions on Zen 1 just means 2 uops per instruction, with the same 2 memory uops per clock throughput limit. 在Zen 1上使用256位指令仅意味着每条指令2 uops，而每个时钟吞吐量限制相同的2存储器uops。 Better front-end throughput from using 2-uop instructions, but that's not the bottleneck here. 通过使用2 uop指令，可以实现更好的前端吞吐量，但这不是瓶颈。 (Assuming the compiler can make a loop in 5 uops or less it can issue at 1 iter per clock, and couldn't even run that fast because of the back-end bottleneck on load/store ports.) （假设编译器可以在5微秒或更短的时间内产生一个循环，则它可以每个时钟1次迭代发出一次，并且由于加载/存储端口上的后端瓶颈，甚至无法以如此快的速度运行。）

Those results would also make sense on a Zen 2, I think: 256-bit SIMD execution units and I think also load/store ports mean that you can expect up to 8x speedups when doing 8x the amount of work per instruction. 我认为，在Zen 2上，这些结果也很有意义：256位SIMD执行单元以及加载/存储端口意味着，每条指令执行8倍的工作量时，您可以期望将速度提高8倍。

为什么与小阵列相比，大阵列的C＃SIMD的性能增益较低？

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-08-16 10:40:33

Large sizes bottleneck on memory bandwidth, not just ALU 内存带宽的大瓶颈，而不仅仅是ALU

为什么与小阵列相比，大阵列的C＃SIMD的性能增益较低？

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-08-16 10:40:33

Large sizes bottleneck on memory bandwidth, not just ALU 内存带宽的大瓶颈，而不仅仅是ALU

解决方案1
3 已采纳 2019-08-16 10:40:33