简体   繁体   English

如何在热循环中优化写入 memory

[英]How to optimize for writes to memory in hot loop

I have a loop in my code where I spend most of the CPU time:我的代码中有一个循环,大部分 CPU 时间都花在了这个循环上:

%%write_to_intermediate_head:
    ; uint64_t count_index = (arr[i] >> (8 * cur_digit)) & (RADIX - 1);
    ; We will use rsi to hold the digit we're currently examining
    mov rsi, [rdx + rdi * 8] ; Load arr[i] into rsi
    mov r9, %1 ; Load current digit were sorting on
    shl r9, 3  ; Multiply by 8
    shrx rsi, rsi, r9
    and rsi, 255

    ; intermediate_arr[--digit_counts[count_index]] = arr[i];
    ; rdi is the loop counter i
    ; rsi is the count_index
    ; rcx is the digit_counts array
    ; rax is the intermediate_arr
    dec qword [rcx + rsi * 8]
    mov r9, [rcx + rsi * 8] ; --digit_counts[count_index]

    mov rsi, [rdx + rdi * 8] ; arr[i]
    mov [rax + r9 * 8], rsi

    dec rdi
    jnz %%write_to_intermediate_head

The variables: digit_counts, arr, and intermediate_arr are all in memory (heap and bss).变量:digit_counts、arr、intermediate_arr都在memory(heap和bss)。 The AMD profiler shows that many cycles are spent reading and writing to these memory locations. AMD 分析器显示许多周期用于读取和写入这些 memory 位置。 Is there any way to optimize this?有什么办法可以优化这个吗?

Do your counts truly need to be qwords, or could you use a narrower type to cut your cache footprint in half with 32-bit (or even less with narrower)?您的计数真的需要是 qwords,还是可以使用更窄的类型将 32 位的缓存占用空间减半(或者更窄的更少)? If you're getting cache misses, that's going to mean much more time spent waiting for loads/stores if OoO exec can't hide that latency.如果您遇到缓存未命中,这将意味着如果 OoO exec 无法隐藏该延迟,则将花费更多时间等待加载/存储。

I guess copying the data around is going to be most of the memory bandwidth / cache misses, though.不过,我想复制数据将是 memory 带宽/缓存未命中的大部分。 This looks like Radix Sort, and the amount of metadata to manage is smallish compared to the data.这看起来像 Radix Sort,与数据相比,要管理的元数据量很小。 (But having at least it hit in cache can help, making it more resistant to eviction by all the other data you're throwing around.) (但至少它在缓存中命中会有所帮助,使其更能抵抗您扔掉的所有其他数据的驱逐。)


No matter what you do, the access pattern of Radix Sort is inherently not very cache-friendly , although it's not terrible.无论你做什么,Radix Sort 的访问模式本质上都不是很缓存友好,尽管它并不可怕。 You're scattering writes across 256 possible destinations, along with updating the pointers.您正在将写入分散到 256 个可能的目的地,同时更新指针。 But those 256 destinations are sequential streams so they can hit in L1d cache if you're lucky.但是这 256 个目的地是顺序流,所以如果幸运的话,它们可以命中 L1d 缓存。

Hopefully those destinations aren't multiples of 4k apart (initially or most of the time), otherwise they'll alias the same line in L1d cache and cause conflict misses.希望这些目的地不是 4k 的倍数(最初或大部分时间),否则它们将在 L1d 缓存中使用同一行并导致冲突未命中。 (ie force eviction of another partially-written cache line that you're soon going to write to.) (即强制驱逐您即将写入的另一个部分写入的缓存行。)


You have some redundant loads / stores which may be a bottleneck for load/store execution units, but if that's not the bottleneck then cache will absorb them just fine.您有一些冗余加载/存储,这可能是加载/存储执行单元的瓶颈,但如果这不是瓶颈,那么缓存将很好地吸收它们。 This section is mostly about tuning the loop to use fewer uops, improving things in the no-cache-miss best case, and giving OoO exec less latency to hide.本节主要是关于调整循环以使用更少的 uops,在 no-cache-miss 最佳情况下进行改进,并为 OoO exec 提供更少的隐藏延迟。

Using a memory-destination dec and then reloading the dec 's store seems obviously inefficient in terms of total back-end load/store operations, and latency for OoO exec to hide.使用内存目标dec然后重新加载dec的存储在总后端加载/存储操作和 OoO exec 隐藏的延迟方面显然效率低下。 (Although on AMD, dec mem is still a single uop for the front-end, vs. 3 on Intel; https://uops.info/ and https://agner.org/optimize/ ). (虽然在 AMD 上, dec mem仍然是前端的单个 uop,而在 Intel 上是 3 个; https://uops.info/https://agner.org/optimize/ )。

Similarly, aren't you loading [rdx + rdi * 8]; arr[i]同样,你不是在加载[rdx + rdi * 8]; arr[i] [rdx + rdi * 8]; arr[i] twice, with the same RDI? [rdx + rdi * 8]; arr[i]两次,使用相同的 RDI? SHRX can copy-and-shift so you wouldn't even be saving uops by keeping around that load result for later. SHRX 可以复制和移动,因此您甚至不会通过保留该加载结果以备后用而节省微指令。 (You could also use a simple non-indexed addressing mode for arr[i] , by doing a pointer-increment like add rdi,8 and cmp rdi, endp / jne where end is something you calculated earlier with lea endp, [rdx + size*8] . Looping forward over an array can be better for some HW prefetchers.) (你也可以为arr[i]使用一个简单的非索引寻址模式,通过像add rdi,8cmp rdi, endp / jne这样的指针递增,其中 end 是你之前用lea endp, [rdx + size*8] 。对于一些硬件预取器来说,在数组上向前循环可能更好。)

x86-64 has 15 registers, so if you need more for this inner loop, save/restore some call-preserved registers (like RBX or RBP) at the top/bottom of the function. Or spill some outer-loop stuff to memory if necessary. x86-64 有 15 个寄存器,所以如果你需要更多的这个内部循环,在 function 的顶部/底部保存/恢复一些调用保留寄存器(如 RBX 或 RBP)。或者将一些外部循环的东西溢出到 memory 如果必要的。

mov r9, %1 looks loop-invariant, so hoist that shl r9, 3 calculation out of the loop, and don't overwrite R9 inside the inner loop. mov r9, %1看起来是循环不变的,因此将shl r9, 3计算提升到循环之外,并且不要在内循环中覆盖 R9。


You do need to zero-extend the byte you extracted, but and rsi, 255 is not as efficient as movzx eax, sil .您确实需要对提取的字节进行零扩展,但是and rsi, 255的效率不如movzx eax, sil (Or better, pick a register like ECX whose low byte can be accessed without a REX prefix). (或者更好的是,选择一个像 ECX 这样的寄存器,它的低字节可以在没有 REX 前缀的情况下访问)。 AMD can't do mov-elimination on movzx the way Intel can, though, so just saving code size for AMD, but optimizing latency if you ever run this on Intel CPUs. AMD 无法像 Intel 那样在 movzx 上执行 mov-elimination,因此只是为 AMD 节省代码大小,但如果您在 Intel CPU 上运行它,则可以优化延迟。

Or better, AMD Zen has single-uop BMI1 bextr r64,r64,r64 , so prepare a start/length pair in the low 2 bytes of a register.或者更好的是, AMD Zen 具有单uop BMI1 bextr r64,r64,r64 ,因此在寄存器的低 2 字节中准备一个起始/长度对。 As discussed, that's loop invariant.如前所述,这是循环不变的。 ie before the loop mov ecx, %k1 / shl cl, 3 / mov ch, 0x8 (AMD doesn't have partial-register stalls, just false dependencies. In this case true, since we want to merge.) If that's inline asm syntax, %k1 specifies the 32-bit version of the register.即在循环之前mov ecx, %k1 / shl cl, 3 / mov ch, 0x8 (AMD 没有部分寄存器停顿,只有错误的依赖关系。在这种情况下是正确的,因为我们想要合并。)如果那是内联 asm语法, %k1指定寄存器的 32 位版本。 Or if it's memory, you're just loading anyway, and hoisting it will save another load!或者如果是memory,你反正就是加载,吊起来又省了一个加载!

(Intel has 2-uop bextr , presumably shift and bzhi uops.) (英特尔有 2- bextr ,大概是 shift 和 bzhi uops。)

Or if you really want to load twice, movzx esi, byte [rdi + rdx] where RDI is a pointer to arr[[i] that you increment or decrement, and RDX is a byte offset.或者如果你真的想加载两次, movzx esi, byte [rdi + rdx]其中 RDI 是指向你递增或递减的arr[[i]的指针,而 RDX 是一个字节偏移量。 But probably BEXTR is a better bet.但 BEXTR 可能是更好的选择。

Other optimizations.其他优化。 The initial pass to generate counts can be done for all digits at the same time, using a matrix instead of an array.生成计数的初始传递可以同时对所有数字完成,使用矩阵而不是数组。 For 64 bit unsigned integers, doing 8 passes with 1 byte digits is close enough to ideal, as the counts |对于 64 位无符号整数,用 1 字节数字进行 8 次传递已经足够接近理想,因为计数 | indexes will fit in the L1 cache.索引将适合 L1 缓存。 The initial pass stores counts in a [8][256] matrix, and 32 bit counts|indexes should be good enough.初始通道将计数存储在 [8][256] 矩阵中,32 位计数|索引应该足够好。

For a large array much larger than cache, if the data to be sorted is reasonably uniform, then the first radix sort pass can be a most significant digit pass, creating 256 bins if using 1 byte digits, with the goal of each of the 256 bins fitting in cache, and doing radix sort least significant digit first on each of the 256 bins, one bin at a time.对于比缓存大得多的大数组,如果要排序的数据相当均匀,那么第一个基数排序通道可以是最高有效位通道,如果使用 1 字节数字则创建 256 个 bin,目标是 256 个中的每一个适合缓存的箱子,并在 256 个箱子中的每一个上首先对最低有效位进行基数排序,一次一个箱子。 If the array is larger, the first pass can create more bins, 512 (9 bit digit), 1024 (10 bit digit), ..., then each bin can still be sorted using 1 byte digits with a smaller digit on the last pass.如果数组更大,第一遍可以创建更多的 bins,512(9 位数字),1024(10 位数字),...,然后每个 bin 仍然可以使用 1 字节数字进行排序,最后一个数字较小经过。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM