代码 alignment 显着影响性能

Question

Today I have found sample code which slowed down by 50%, after adding some unrelated code.今天我发现示例代码在添加一些不相关的代码后减慢了 50%。 After debugging I have figured out the problem was in the loop alignment.调试后我发现问题出在循环 alignment 中。 Depending of the loop code placement there is different execution time eg:根据循环代码的位置，有不同的执行时间，例如：

Address地址	Time[us]时间[我们]
00007FF780A01270 00007FF780A01270	980us 980us
00007FF7750B1280 00007FF7750B1280	1500us 1500us
00007FF7750B1290 00007FF7750B1290	986us 986us
00007FF7750B12A0 00007FF7750B12A0	1500us 1500us

I didn't expect previously that code alignment may have such a big impact.之前没想到代码 alignment 会产生这么大的影响。 And I thought my compiler is smart enough to align the code correctly.而且我认为我的编译器足够聪明，可以正确对齐代码。

What exactly cause such a big difference in execution time?究竟是什么导致了执行时间如此大的差异？ (I suppose some processor architecture details). （我想一些处理器架构细节）。

The test program I have compiled in Release mode with Visual Studio 2019 and run it on Windows 10. I have checked the program on 2 processors: i7-8700k (the results above), and on intel i5-3570k but the problem does not exist there and the execution time is always about 1250us.我使用 Visual Studio 2019 在 Release 模式下编译的测试程序并在 Windows 10 上运行。我在 2 个处理器上检查了该程序：i7-8700k（以上结果）和 intel i5-3570k，但问题不存在在那里，执行时间总是大约 1250us。 I have also tried to compile the program with clang, but with clang the result is always ~1500us (on i7-8700k).我还尝试使用 clang 编译程序，但使用 clang 结果始终为 ~1500us（在 i7-8700k 上）。

My test program:我的测试程序：

#include <chrono>
#include <iostream>
#include <intrin.h>
using namespace std;

template<int N>
__forceinline void noops()
{
    __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop();
    noops<N - 1>();
}
template<>
__forceinline void noops<0>(){}

template<int OFFSET>
__declspec(noinline) void SumHorizontalLine(const unsigned char* __restrict src, int width, int a, unsigned short* __restrict dst)
{
    unsigned short sum = 0;
    const unsigned char* srcP1 = src - a - 1;
    const unsigned char* srcP2 = src + a;

    //some dummy loop,just a few iterations
    for (int i = 0; i < a; ++i)
        dst[i] = src[i] / (double)dst[i];

    noops<OFFSET>();
    //the important loop
    for (int x = a + 1; x < width - a; x++)
    {
        unsigned char v1 = srcP1[x];
        unsigned char v2 = srcP2[x];
        sum -= v1;
        sum += v2;
        dst[x] = sum;
    }

}

template<int OFFSET>
void RunTest(unsigned char* __restrict src, int width, int a, unsigned short* __restrict dst)
{
    double minTime = 99999999;
    for(int i = 0; i < 20; ++i)
    {
        auto start = chrono::steady_clock::now();

        for (int i = 0; i < 1024; ++i)
        {
            SumHorizontalLine<OFFSET>(src, width, a, dst);
        }

        auto end = chrono::steady_clock::now();
        auto us = chrono::duration_cast<chrono::microseconds>(end - start).count();
        if (us < minTime)
        {
            minTime = us;
        }
    }

    cout << OFFSET << " : " << minTime << " us" << endl;
}

int main()
{
    const int width = 2048;
    const int x = 3;
    unsigned char* src = new unsigned char[width * 5];
    unsigned short* dst = new unsigned short[width];
    memset(src, 0, sizeof(unsigned char) * width);
    memset(dst, 0, sizeof(unsigned short) * width);

    while(true)
    RunTest<1>(src, width, x, dst);
}

To verify different alignment, just recompile the program and change RunTest<0> to RunTest<1> etc. Compiler always align the code to 16bytes.要验证不同的 alignment，只需重新编译程序并将 RunTest<0> 更改为 RunTest<1> 等。编译器始终将代码对齐为 16 字节。 In my test code I just insert additional nops to move the code a bit more.在我的测试代码中，我只是插入了额外的 nop 来进一步移动代码。

Assembly code generated for the loop with OFFSET=1 (for other offset only the amount of npads is different):为 OFFSET=1 的循环生成的汇编代码（对于其他偏移量，只有 npad 的数量不同）：

  0007c 90       npad    1
  0007d 90       npad    1
  0007e 49 83 c1 08  add     r9, 8
  00082 90       npad    1
  00083 90       npad    1
  00084 90       npad    1
  00085 90       npad    1
  00086 90       npad    1
  00087 90       npad    1
  00088 90       npad    1
  00089 90       npad    1
  0008a 90       npad    1
  0008b 90       npad    1
  0008c 90       npad    1
  0008d 90       npad    1
  0008e 90       npad    1
  0008f 90       npad    1
$LL15@SumHorizon:

; 25   : 
; 26   :    noops<OFFSET>();
; 27   : 
; 28   :    for (int x = a + 1; x < width - a; x++)
; 29   :    {
; 30   :        unsigned char v1 = srcP1[x];
; 31   :        unsigned char v2 = srcP2[x];
; 32   :        sum -= v1;

  00090 0f b6 42 f9  movzx   eax, BYTE PTR [rdx-7]
  00094 4d 8d 49 02  lea     r9, QWORD PTR [r9+2]

; 33   :        sum += v2;

  00098 0f b6 0a     movzx   ecx, BYTE PTR [rdx]
  0009b 48 8d 52 01  lea     rdx, QWORD PTR [rdx+1]
  0009f 66 2b c8     sub     cx, ax
  000a2 66 44 03 c1  add     r8w, cx

; 34   :        dst[x] = sum;

  000a6 66 45 89 41 fe   mov     WORD PTR [r9-2], r8w
  000ab 49 83 ea 01  sub     r10, 1
  000af 75 df        jne     SHORT $LL15@SumHorizon

; 35   :    }
; 36   : 
; 37   : }

  000b1 c3       ret     0
??$SumHorizontalLine@$00@@YAXPEIBEHHPEIAG@Z ENDP    ; SumHorizont

Answer 1

In the slow cases (ie, 00007FF7750B1280 and 00007FF7750B12A0), the jne instruction crosses a 32-byte boundary.在慢速情况下（即 00007FF7750B1280 和 00007FF7750B12A0）， jne指令跨越 32 字节边界。 The mitigations for the "Jump Conditional Code" (JCC) erratum ( https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf ) prevent such instructions from being cached in the DSB. “跳转条件代码” (JCC) 勘误表的缓解措施 ( https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf )防止此类指令被缓存在 DSB 中。 The JCC erratum only applies to Skylake-based CPUs, which is why the effect does not occur on your i5-3570k CPU. JCC 勘误表仅适用于基于 Skylake 的 CPU，这就是为什么在您的 i5-3570k CPU 上不会出现这种影响的原因。

As Peter Cordes pointed out in a comment, recent compilers have options that try to mitigate this effect.正如 Peter Cordes 在评论中指出的那样，最近的编译器有一些选项试图减轻这种影响。 Intel JCC Erratum - should JCC really be treated separately?英特尔 JCC 勘误表 - JCC 真的应该单独对待吗？ mentions MSVC's /QIntel-jcc-erratum option;提到 MSVC 的/QIntel-jcc-erratum选项； another related question is How can I mitigate the impact of the Intel jcc erratum on gcc?另一个相关问题是如何减轻英特尔 jcc 勘误表对 gcc 的影响？

Answer 2

I thought my compiler is smart enough to align the code correctly.我认为我的编译器足够聪明，可以正确对齐代码。

As you said, the compiler is always aligning things to a multiple of 16 bytes.正如您所说，编译器总是将事物对齐到 16 个字节的倍数。 This probably does account for the direct effects of alignment.这可能确实说明了 alignment 的直接影响。 But there are limits to the "smartness" of the compiler.但是编译器的“智能”是有限度的。

Besides alignment, code placement has indirect performance effects as well, because of cache associativity.除了 alignment 之外，由于缓存关联性，代码放置也会对性能产生间接影响。 If there is too much contention for the few cache lines that can map to this address, performance will suffer.如果对可以 map 到该地址的少数高速缓存行的争用过多，性能将受到影响。 Moving to an address with less contention makes the problem go away.移动到争用较少的地址会使问题 go 消失。

The compiler may be smart enough to handle cache contention effects as well, but only IF you turn on profile-guided optimization.编译器可能也足够聪明，可以处理缓存争用效果，但前提是您打开配置文件引导优化。 The interactions are far too complex to predict in a reasonable amount of work;交互过于复杂，无法在合理的工作量中进行预测； it is much easier to watch for cache conflicts by actually running the program and that's what PGO does.通过实际运行程序来观察缓存冲突要容易得多，这就是 PGO 所做的。

代码 alignment 显着影响性能

问题描述

2 个解决方案

解决方案1
9 已采纳 2021-05-08 02:19:17

解决方案2
0 2021-05-07 22:18:53

代码 alignment 显着影响性能

问题描述

2 个解决方案

解决方案1 9 已采纳 2021-05-08 02:19:17

解决方案2 0 2021-05-07 22:18:53

解决方案1
9 已采纳 2021-05-08 02:19:17

解决方案2
0 2021-05-07 22:18:53