编写跨步的x86基准测试

Question

I'd like to write a load benchmark that strides across a given region of memory with a compile-time known stride, wrapping at the end of the region (power-of-2) with as few non-load instructions as possible. 我想编写一个负载基准，以已知的编译时跨度跨给定的内存区域，并以尽可能少的非负载指令包装在区域的末尾（2的幂）。

For example given a stride of 4099, and the iteration count in rdi and the pointer to the memory region in rsi something that "works" is: 例如，步幅为4099， rdi的迭代计数和rsi的指向内存区域的指针是“有效的”：

%define STRIDE 4099
%define SIZE   128 * 1024
%define MASK   (SIZE - 1)
xor     ecx, ecx

.top:
mov      al, [rsi + rcx]
add     ecx, STRIDE
and     ecx, MASK
dec rdi
jnz .top

The problem is that there are 4 non-load instructions just to support the single load, dealing with the stride addition, masking and loop termination check. 问题在于，有4条非加载指令仅用于支持单个加载，处理步幅加法，掩码和循环终止检查。 Also, there is a 2-cycle dependency chain carried through ecx . 此外， ecx还带有一个2周期依赖链。

We can unroll this a bit to reduce the loop termination check cost to near zero, and break up the dependency chain (here unrolled by 4x): 我们可以稍微展开一下，以将循环终止检查成本降低到接近零，并分解依赖关系链（此处展开4倍）：

.top:

lea     edx, [ecx + STRIDE * 0]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 1]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 2]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 3]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

add     ecx, STRIDE * 4

dec rdi
jnz .top

However, this it doesn't help for the add and and operations dealing with the wrapping the stride. 但是，这对于处理跨步的add和and操作没有帮助。 For example, the above benmark reports 0.875 cycles/load for an L1-contained region, but we know the right answer is 0.5 (two loads per cycle). 例如，上面的benmark报告L1包含区域为0.875个循环/负载，但是我们知道正确的答案是0.5（每个循环两个负载）。 The 0.875 comes from 15 total total uops / 4 uops cycle, ie, we are bound by the 4-wide maximum width of uop throughput, not on load throughput. 0.875来自15个总uops / 4 uops周期，即，我们受uop吞吐量的4宽最大宽度约束，而不是负载吞吐量。

Any idea on how to a good way to effectively unroll the loop to remove most of the cost of the stride calculation? 关于如何有效展开循环以消除跨步计算的大部分成本的任何想法？

Answer 1

For "absolute maximum insanity"; 为了“绝对最大的精神错乱”； you can ask the OS to map the same pages at many virtual addresses (eg so the same 16 MiB of RAM appears at virtual addresses 0x100000000, 0x11000000, 0x12000000, 0x13000000, ...) to avoid the need to care about wrapping; 您可以要求操作系统在许多虚拟地址上映射相同的页面（例如，使相同的16 MiB RAM出现在虚拟地址0x100000000、0x11000000、0x12000000、0x13000000等），以避免需要进行换行； and you can use self-generating code to avoid everything else. 您可以使用自行生成的代码来避免其他一切。 Basically, code that generates instructions that look like: 基本上，生成如下指令的代码：

    movzx eax, BYTE [address1]
    movzx ebx, BYTE [address2]
    movzx ecx, BYTE [address3]
    movzx edx, BYTE [address4]
    movzx esi, BYTE [address5]
    movzx edi, BYTE [address6]
    movzx ebp, BYTE [address7]
    movzx eax, BYTE [address8]
    movzx ebx, BYTE [address9]
    movzx ecx, BYTE [address10]
    movzx edx, BYTE [address11]
    ...
    movzx edx, BYTE [address998]
    movzx esi, BYTE [address999]
    ret

Of course all of the addresses used would be calculated by the code that generates the instructions. 当然，所有使用的地址都将由生成指令的代码来计算。

Note: Depending on which specific CPU it is, it may be faster to have a loop rather than completely unrolling (there's a compromise between instruction fetch and decode costs and loop overhead). 注意：根据具体的CPU，有一个循环而不是完全展开可能更快（在指令获取和解码成本与循环开销之间进行折衷）。 For more recent Intel CPUs there's a thing called a loop stream detector designed to avoid fetch and decode for loops smaller than a certain size (where that size depends on CPU model); 对于较新的Intel CPU，有一种称为“循环流检测器”的东西，旨在避免对小于特定大小（该大小取决于CPU型号）的循环进行获取和解码。 and I'd assume generating a loop that fits within that size would be optimal. 并且我认为生成适合该大小的循环是最佳的。

Answer 2

About that math. 关于那个数学。 proof ... at the beginning of unrolled loop, if ecx < STRIDE , and n = (SIZE div STRIDE) , and SIZE is not divisible by STRIDE, then (n-1)*STRIDE < SIZE , ie n-1 iterations are safe without masking. 证明...在展开循环的开头，如果ecx < STRIDE ，并且n = (SIZE div STRIDE) ，并且SIZE无法被STRIDE整除，则(n-1)*STRIDE < SIZE ，即n-1次迭代安全无遮挡。 The n-th iteration may, and may not need masking (depends on initial ecx ). 第n次迭代可能并且可能不需要屏蔽（取决于初始ecx ）。 If the n-th iteration did not need mask, the (n+1)-th will need it. 如果第n次迭代不需要掩码，则第（n + 1）次需要掩码。

The consequence is, that you can design code like this 结果是，您可以像这样设计代码

    xor    ecx, ecx
    jmp    loop_entry
unrolled_loop:
    and    ecx, MASK     ; make ecx < STRIDE again
    jz     terminate_loop
loop_entry:
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    ... (SIZE div STRIDE)-1 times
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE

    ;after (SIZE div STRIDE)-th add ecx,STRIDE
    cmp    ecx, SIZE
    jae    unrolled_loop
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    ; assert( ecx >= SIZE )
    jmp    unrolled_loop

terminate_loop:

The amount of add happening before and is needed is not regular, it will be n or n+1 , so the end of unrolled loop has be doubled, to start each unrolled loop with ecx < STRIDE value. 之前and需要发生的add数量不是规则的，它将是n或n+1 ，因此展开循环的末尾已加倍，以ecx < STRIDE值开始每个展开的循环。

I'm not good with nasm macros to decide if this can be unrolled by some kind of macro magic. 我对nasm宏不好，不能决定是否可以通过某种宏魔术来展开它。

Also there's a question whether this can be macro-ed to different registers, like 还有一个问题是否可以将其宏化到不同的寄存器，例如

    xor    ecx, ecx

    ...
loop_entry:
    lea    rdx,[rcx + STRIDE*4]  
    movzx  eax, BYTE [rsi + rcx]
    movzx  eax, BYTE [rsi + rcx + STRIDE]
    movzx  eax, BYTE [rsi + rcx + STRIDE*2]
    movzx  eax, BYTE [rsi + rcx + STRIDE*3]
    add    ecx, STRIDE*8
    movzx  eax, BYTE [rsi + rdx]
    movzx  eax, BYTE [rsi + rdx + STRIDE]
    movzx  eax, BYTE [rsi + rdx + STRIDE*2]
    movzx  eax, BYTE [rsi + rdx + STRIDE*3]
    add    edx, STRIDE*8
    ...

    then the final part can be filled with simple
    movzx  eax, BYTE [rsi + rcx]
    add    ecx, STRIDE
    ... until the n-th ADD state is reached, then jae loop final

    ;after (SIZE div STRIDE)-th add ecx,STRIDE
    cmp    ecx, SIZE
    jae    unrolled_loop
    movzx  eax, BYTE [rsi + rcx]
    add    ecx, STRIDE
    ; assert( ecx >= SIZE )
    jmp    unrolled_loop

Also the inner "safe" part can be looped by some amounts, like if SIZE div STRIDE = 31.97657965357404 in your example, then the inner 8 times movzx can be looped 3 times ... 3*8 = 24, then 7 times the non-and simple lines to reach 31x add , then the doubled loop exit follows to reach eventually 32nd add as needed. 内部的“安全”部分也可以循环一些，例如，如果您的示例中SIZE div STRIDE = 31.97657965357404，那么内部的8倍movzx可以循环3次... 3 * 8 = 24，然后是7倍的非循环-和简单的行达到31x add ，然后根据需要加倍循环出口，最终达到32nd add 。

Although in case of your 31.9 it looks pointless, would make sense to loop over middle part in case of like hundreds+ = SIZE div STRIDE. 尽管在您的31.9的情况下看起来毫无意义，但在类似数百+ = SIZE div STRIDE的情况下循环中间部分是有意义的。

Answer 3

If you use AVX2 gather to generate the load uops, you can use SIMD for the add + AND. 如果使用AVX2收集来生成载荷，则可以将SIMD用于添加+与。 This probably isn't what you want when trying to measure anything about non-gather loads, though! 但是，当尝试测量有关非聚集负载的任何内容时，这可能并不是您想要的！

If your region size is 2^16..19, you can use add ax, dx (with DX=stride to avoid LCP stalls) to get wrapping at 2^16 for free. 如果您的区域大小为2 ^ 16..19，则可以使用add ax, dx （DX = stride以避免LCP停顿）免费获得2 ^ 16的环绕效果。 Use eax as a scaled index. 使用eax作为缩放索引。 With lea di, [eax + STRIDE * n] and so on in an unrolled loop, this could save enough uops to let you run 2 loads per clock without bottlenecking on the front-end. 使用lea di, [eax + STRIDE * n]等在展开的循环中，这样可以节省足够的微指令，使您每个时钟运行2个负载而不会在前端造成瓶颈。 But the partial-register merging dependencies (on Skylake) would create multiple loop-carried dep chains, and you'd run out of registers in 32-bit mode if you need to avoid reusing them. 但是，部分寄存器合并依赖项（在Skylake上）将创建多个循环传送的dep链，如果需要避免重复使用它们，则会以32位模式用尽寄存器。

You could even consider mapping the low 64k of virtual memory (on Linux set vm.mmap_min_addr=0 ) and using 16-bit addressing modes in 32-bit code. 您甚至可以考虑映射低64k的虚拟内存（在Linux上设置vm.mmap_min_addr=0 ），并在32位代码中使用16位寻址模式。 Reading only 16-bit registers avoids complications of having to only write 16 bits; 仅读取16位寄存器避免了仅写入16位的麻烦。 it's fine to end up with garbage in the upper 16. 最好在鞋帮16中放入垃圾。

To do better without 16-bit addressing modes, you need to create conditions where you know wrapping can't happen . 为了在没有16位寻址模式的情况下做得更好，您需要创建知道无法发生换行的条件 。 This allows unrolling with [reg + STRIDE * n] addressing modes. 这允许使用[reg + STRIDE * n]寻址模式展开。

You could write a normal unrolled loop that breaks out when approaching the wrap-around point (ie when ecx + STRIDE*n > bufsize ), but that won't predict well if bufsize / STRIDE is greater than about 22 on Skylake. 您可以编写一个正常的展开循环，该循环在接近折返点时会中断（例如，当ecx + STRIDE*n > bufsize ），但是如果bufsize / STRIDE在Skylake上大于约22，则无法很好预测。

You could simply do the AND masking only once per iteration, and relax the constraint that the working set is exactly 2^n bytes. 您可以每次迭代仅进行一次AND屏蔽，并放宽工作集正好为 2 ^ n个字节的约束。 ie if you reserve enough space for your loads to go beyond the end by up to STRIDE * n - 1 , and you're ok with that cache behaviour, then just do it. 例如，如果您保留了足够的空间来使负载超出末尾，直到STRIDE * n - 1 ，并且您对这种缓存行为没问题，那就去做吧。

If you choose your unroll factor carefully, you can maybe control where the wraparound will happen every time. 如果您仔细选择展开因子，则可以控制每次环绕的发生位置。 But with a prime stride and a power of 2 buffer, I think you'd need an unroll of lcm(stride, bufsize/stride) = stride * bufsize/stride = bufsize for the pattern to repeat. 但是，如果有一个大跨度和2的幂，我认为您需要展开lcm(stride, bufsize/stride) = stride * bufsize/stride = bufsize才能重复该模式。 For buffer sizes that don't fit in L1, this unroll factor is too large to fit in the uop cache, or even L1I. 对于不适合L1的缓冲区大小，此展开因子太大而无法放入uop缓存，甚至L1I。 I looked at a couple small test cases, like n*7 % 16 , which repeats after 16 iterations, just like n*5 % 16 and n*3 % 16 . 我看了几个小型测试用例，例如n*7 % 16 ，它在16次迭代后重复执行，就像n*5 % 16和n*3 % 16 。 And n*7 % 32 repeats after 32 iterations. 并且n*7 % 32在32次迭代后重复32次。 ie the linear congruential generator explores every value less than the modulus when the multiplier and modulus are relatively prime. 也就是说，当乘数和模数是相对质数时，线性同余生成器会探索每个小于模数的值。

None of these options are ideal, but that's the best I can suggest for now. 这些选项都不是理想的，但这是我目前可以建议的最佳选择。

编写跨步的x86基准测试

问题描述

3 个解决方案

解决方案1
2 2017-12-24 03:36:22

解决方案2
2 2017-12-24 09:55:02

解决方案3
1 2017-12-24 02:33:43

编写跨步的x86基准测试

问题描述

3 个解决方案

解决方案1 2 2017-12-24 03:36:22

解决方案2 2 2017-12-24 09:55:02

解决方案3 1 2017-12-24 02:33:43

解决方案1
2 2017-12-24 03:36:22

解决方案2
2 2017-12-24 09:55:02

解决方案3
1 2017-12-24 02:33:43