Writing a strided x86 benchmark

Question

I'd like to write a load benchmark that strides across a given region of memory with a compile-time known stride, wrapping at the end of the region (power-of-2) with as few non-load instructions as possible.

For example given a stride of 4099, and the iteration count in rdi and the pointer to the memory region in rsi something that "works" is:

%define STRIDE 4099
%define SIZE   128 * 1024
%define MASK   (SIZE - 1)
xor     ecx, ecx

.top:
mov      al, [rsi + rcx]
add     ecx, STRIDE
and     ecx, MASK
dec rdi
jnz .top

The problem is that there are 4 non-load instructions just to support the single load, dealing with the stride addition, masking and loop termination check. Also, there is a 2-cycle dependency chain carried through ecx .

We can unroll this a bit to reduce the loop termination check cost to near zero, and break up the dependency chain (here unrolled by 4x):

.top:

lea     edx, [ecx + STRIDE * 0]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 1]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 2]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 3]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

add     ecx, STRIDE * 4

dec rdi
jnz .top

However, this it doesn't help for the add and and operations dealing with the wrapping the stride. For example, the above benmark reports 0.875 cycles/load for an L1-contained region, but we know the right answer is 0.5 (two loads per cycle). The 0.875 comes from 15 total total uops / 4 uops cycle, ie, we are bound by the 4-wide maximum width of uop throughput, not on load throughput.

Any idea on how to a good way to effectively unroll the loop to remove most of the cost of the stride calculation?

Answer 1

For "absolute maximum insanity"; you can ask the OS to map the same pages at many virtual addresses (eg so the same 16 MiB of RAM appears at virtual addresses 0x100000000, 0x11000000, 0x12000000, 0x13000000, ...) to avoid the need to care about wrapping; and you can use self-generating code to avoid everything else. Basically, code that generates instructions that look like:

    movzx eax, BYTE [address1]
    movzx ebx, BYTE [address2]
    movzx ecx, BYTE [address3]
    movzx edx, BYTE [address4]
    movzx esi, BYTE [address5]
    movzx edi, BYTE [address6]
    movzx ebp, BYTE [address7]
    movzx eax, BYTE [address8]
    movzx ebx, BYTE [address9]
    movzx ecx, BYTE [address10]
    movzx edx, BYTE [address11]
    ...
    movzx edx, BYTE [address998]
    movzx esi, BYTE [address999]
    ret

Of course all of the addresses used would be calculated by the code that generates the instructions.

Note: Depending on which specific CPU it is, it may be faster to have a loop rather than completely unrolling (there's a compromise between instruction fetch and decode costs and loop overhead). For more recent Intel CPUs there's a thing called a loop stream detector designed to avoid fetch and decode for loops smaller than a certain size (where that size depends on CPU model); and I'd assume generating a loop that fits within that size would be optimal.

Answer 2

About that math. proof ... at the beginning of unrolled loop, if ecx < STRIDE , and n = (SIZE div STRIDE) , and SIZE is not divisible by STRIDE, then (n-1)*STRIDE < SIZE , ie n-1 iterations are safe without masking. The n-th iteration may, and may not need masking (depends on initial ecx ). If the n-th iteration did not need mask, the (n+1)-th will need it.

The consequence is, that you can design code like this

    xor    ecx, ecx
    jmp    loop_entry
unrolled_loop:
    and    ecx, MASK     ; make ecx < STRIDE again
    jz     terminate_loop
loop_entry:
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    ... (SIZE div STRIDE)-1 times
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE

    ;after (SIZE div STRIDE)-th add ecx,STRIDE
    cmp    ecx, SIZE
    jae    unrolled_loop
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    ; assert( ecx >= SIZE )
    jmp    unrolled_loop

terminate_loop:

The amount of add happening before and is needed is not regular, it will be n or n+1 , so the end of unrolled loop has be doubled, to start each unrolled loop with ecx < STRIDE value.

I'm not good with nasm macros to decide if this can be unrolled by some kind of macro magic.

Also there's a question whether this can be macro-ed to different registers, like

    xor    ecx, ecx

    ...
loop_entry:
    lea    rdx,[rcx + STRIDE*4]  
    movzx  eax, BYTE [rsi + rcx]
    movzx  eax, BYTE [rsi + rcx + STRIDE]
    movzx  eax, BYTE [rsi + rcx + STRIDE*2]
    movzx  eax, BYTE [rsi + rcx + STRIDE*3]
    add    ecx, STRIDE*8
    movzx  eax, BYTE [rsi + rdx]
    movzx  eax, BYTE [rsi + rdx + STRIDE]
    movzx  eax, BYTE [rsi + rdx + STRIDE*2]
    movzx  eax, BYTE [rsi + rdx + STRIDE*3]
    add    edx, STRIDE*8
    ...

    then the final part can be filled with simple
    movzx  eax, BYTE [rsi + rcx]
    add    ecx, STRIDE
    ... until the n-th ADD state is reached, then jae loop final

    ;after (SIZE div STRIDE)-th add ecx,STRIDE
    cmp    ecx, SIZE
    jae    unrolled_loop
    movzx  eax, BYTE [rsi + rcx]
    add    ecx, STRIDE
    ; assert( ecx >= SIZE )
    jmp    unrolled_loop

Also the inner "safe" part can be looped by some amounts, like if SIZE div STRIDE = 31.97657965357404 in your example, then the inner 8 times movzx can be looped 3 times ... 3*8 = 24, then 7 times the non-and simple lines to reach 31x add , then the doubled loop exit follows to reach eventually 32nd add as needed.

Although in case of your 31.9 it looks pointless, would make sense to loop over middle part in case of like hundreds+ = SIZE div STRIDE.

Answer 3

If you use AVX2 gather to generate the load uops, you can use SIMD for the add + AND. This probably isn't what you want when trying to measure anything about non-gather loads, though!

If your region size is 2^16..19, you can use add ax, dx (with DX=stride to avoid LCP stalls) to get wrapping at 2^16 for free. Use eax as a scaled index. With lea di, [eax + STRIDE * n] and so on in an unrolled loop, this could save enough uops to let you run 2 loads per clock without bottlenecking on the front-end. But the partial-register merging dependencies (on Skylake) would create multiple loop-carried dep chains, and you'd run out of registers in 32-bit mode if you need to avoid reusing them.

You could even consider mapping the low 64k of virtual memory (on Linux set vm.mmap_min_addr=0 ) and using 16-bit addressing modes in 32-bit code. Reading only 16-bit registers avoids complications of having to only write 16 bits; it's fine to end up with garbage in the upper 16.

To do better without 16-bit addressing modes, you need to create conditions where you know wrapping can't happen . This allows unrolling with [reg + STRIDE * n] addressing modes.

You could write a normal unrolled loop that breaks out when approaching the wrap-around point (ie when ecx + STRIDE*n > bufsize ), but that won't predict well if bufsize / STRIDE is greater than about 22 on Skylake.

You could simply do the AND masking only once per iteration, and relax the constraint that the working set is exactly 2^n bytes. ie if you reserve enough space for your loads to go beyond the end by up to STRIDE * n - 1 , and you're ok with that cache behaviour, then just do it.

If you choose your unroll factor carefully, you can maybe control where the wraparound will happen every time. But with a prime stride and a power of 2 buffer, I think you'd need an unroll of lcm(stride, bufsize/stride) = stride * bufsize/stride = bufsize for the pattern to repeat. For buffer sizes that don't fit in L1, this unroll factor is too large to fit in the uop cache, or even L1I. I looked at a couple small test cases, like n*7 % 16 , which repeats after 16 iterations, just like n*5 % 16 and n*3 % 16 . And n*7 % 32 repeats after 32 iterations. ie the linear congruential generator explores every value less than the modulus when the multiplier and modulus are relatively prime.

None of these options are ideal, but that's the best I can suggest for now.

Writing a strided x86 benchmark

Question

3 answers

solution1
2 2017-12-24 03:36:22

solution2
2 2017-12-24 09:55:02

solution3
1 2017-12-24 02:33:43

Writing a strided x86 benchmark

Question

3 answers

solution1 2 2017-12-24 03:36:22

solution2 2 2017-12-24 09:55:02

solution3 1 2017-12-24 02:33:43

solution1
2 2017-12-24 03:36:22

solution2
2 2017-12-24 09:55:02

solution3
1 2017-12-24 02:33:43