在某些情况下，在x86-64 Intel / AMD CPU上，128bit / 64bit硬件无符号除法能否比64bit / 32bit除法更快？

Question

Can a scaled 64bit/32bit division performed by the hardware 128bit/64bit division instruction, such as: 可以通过硬件128bit / 64bit除法指令执行缩放的64bit / 32bit除法，例如：

; Entry arguments: Dividend in EAX, Divisor in EBX
shl rax, 32  ;Scale up the Dividend by 2^32
xor rdx,rdx
and rbx, 0xFFFFFFFF  ;Clear any garbage that might have been in the upper half of RBX
div rbx  ; RAX = RDX:RAX / RBX

...be faster in some special cases than the scaled 64bit/32bit division performed by the hardware 64bit/32bit division instruction, such as: ...在某些特殊情况下，比硬件64位/ 32位除法指令执行的缩放64位/ 32位除法更快，例如：

; Entry arguments: Dividend in EAX, Divisor in EBX
mov edx,eax  ;Scale up the Dividend by 2^32
xor eax,eax
div ebx  ; EAX = EDX:EAX / EBX

By "some special cases" I mean unusual dividends and divisors. “某些特殊情况”是指异常的红利和除数。 I am interested in comparing the div instruction only. 我只对比较div指令感兴趣。

Answer 1

You're asking about optimizing uint64_t / uint64_t C division to a 64b / 32b => 32b x86 asm division, when the divisor is known to be 32-bit. 您正在问关于将uint64_t / uint64_t C除法优化为64b / 32b => 32b x86 asm除法（已知除数为32位）的问题。 The compiler must of course avoid the possibility of a #DE exception on a perfectly valid (in C) 64-bit division, otherwise it wouldn't have followed the as-if rule. 当然，编译器必须避免在完全有效的（在C语言中）64位除法中出现#DE异常的可能性，否则，它就不会遵循as-if规则。 So it can only do this if it's provable that the quotient will fit in 32 bits. 因此，只有在商数可以容纳32位的情况下，它才能执行此操作。

Yes, that's a win or at least break-even. 是的，那是一场胜利，或者至少是收支平衡。 On some CPUs it's even worth checking for the possibility at runtime because 64-bit division is so much slower. 在某些CPU上，甚至值得在运行时检查这种可能性，因为64位除法速度要慢得多。 But unfortunately current x86 compilers don't have an optimizer pass to look for this optimization even when you do manage to give them enough info that they could prove it's safe. 但不幸的是当前的x86编译器不具有优化通寻找这个优化 ，即使你设法给他们足够的信息，他们可以证明它是安全的。 eg if (edx >= ebx) __builtin_unreachable(); 例如， if (edx >= ebx) __builtin_unreachable(); doesn't help last time I tried. 上次尝试没有帮助。

For the same inputs, 32-bit operand-size will always be at least as fast 对于相同的输入，32位操作数大小将始终至少与之一样快

16 or 8-bit could maybe be slower than 32 because they may have a false dependency writing their output, but writing a 32-bit register zero-extends to 64 to avoid that. 16或8位可能比32慢，因为它们可能会有错误的依赖性来写入输出，但是为了避免这种情况，写入32位寄存器零扩展到64。 (That's why mov ecx, ebx is a good way to zero-extend ebx to 64-bit, better than and a value that's not encodeable as a 32-bit sign-extended immediate, like harold pointed out). （这就是mov ecx, ebx是将ebx零扩展到64位的好方法的原因，比harhar所指出的要好， and该值不能编码为32位符号扩展的立即数。） But other than partial-register shenanigans, 16-bit and 8-bit division are generally also as fast as 32-bit, or not worse. 但是，除了部分寄存器的恶作剧外，16位和8位除法运算速度通常也与32位一样快，甚至还不差。

On AMD CPUs, division performance doesn't depend on operand-size, just the data . 在AMD CPU上，除法性能不取决于操作数大小，而仅取决于数据 。 0 / 1 with 128/64-bit should be faster than worst-case of any smaller operand-size. 128/64位的0 / 1应该比任何较小的操作数大小的最坏情况都要快。 AMD's integer-division instruction is only a 2 uops (presumably because it has to write 2 registers), with all the logic done in the execution unit. AMD的整数除法指令只有2微秒（大概是因为它必须写入2个寄存器），所有逻辑都在执行单元中完成。

16-bit / 8-bit => 8-bit division on Ryzen is a single uop (because it only has to write AH:AL = AX). Ryzen上的16位/ 8位=> 8位除法是单个uop（因为它只需要写AH：AL = AX）。

On Intel CPUs, div / idiv is microcoded as many uops . 在Intel CPU上， div / idiv被微编码为尽可能多的微码 。 About the same number of uops for all operand-sizes up to 32-bit (Skylake = 10), but 64-bit is much much slower . 对于最大32位（Skylake = 10）的所有操作数大小，大约相同的uops数量，但是64位要慢得多 。 (Skylake div r64 is 36 uops, Skylake idiv r64 is 57 uops). （Skylake div r64为36 div r64 ，Skylake idiv r64为57 idiv r64 ）。 See Agner Fog's instruction tables: https://agner.org/optimize/ 请参阅Agner Fog的说明表： https ：//agner.org/optimize/

div/idiv throughput for operand-sizes up to 32-bit is fixed at 1 per 6 cycles on Skylake. 在Skylake上，最大32位操作数大小的div / idiv吞吐量固定为每6个周期1个。 But div/idiv r64 throughput is one per 24-90 cycles. 但是div/idiv r64吞吐量是每24-90个周期之一。

See also Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux for a specific performance experiment where modifying the REX.W prefix in an existing binary to change div r64 into div r32 made a factor of ~3 difference in throughput. 对于特定的性能实验，通过修改现有二进制文件中的REX.W前缀将div r64更改为div r32 在Windows上，Trial-division代码在32位上的运行速度比Linux在64位上运行的快2倍。吞吐量差异。

And Why does Clang do this optimization trick only from Sandy Bridge onward? 为什么Clang仅从Sandy Bridge开始才做这种优化技巧？ shows clang opportunistically using 32-bit division when the dividend is small, when tuning for Intel CPUs. 显示了当英特尔CPU进行调整时，当股息较小时，机会性地使用32位除法的clang。 But you have a large dividend and a large-enough divisor, which is a more complex case. 但是您有一个大红利和一个足够大的除数，这是一个更复杂的情况。 That clang optimization is still zeroing the upper half of the dividend in asm, never using a non-zero or non-sign-extended EDX. 那种clang优化仍然使asm的上半部分清零，从不使用非零或非符号扩展的EDX。

I have failed to make the popular C compilers generate the latter code when dividing an unsigned 32-bit integer (shifted left 32 bits) by another 32-bit integer. 当将一个无符号的32位整数（左移32位）除以另一个32位整数时，我未能使流行的C编译器生成后者的代码。

I'm assuming you cast that 32-bit integer to uint64_t first , to avoid UB and get a normal uint64_t / uint64_t in the C abstract machine. 我假设你投的是32位整数uint64_t 第一，避免UB，并得到一个正常的uint64_t / uint64_t在C抽象机。

That makes sense: Your way wouldn't be safe, it will fault with #DE when edx >= ebx . 这是有道理的： 您的方式将不安全，当edx >= ebx时，它将以#DE错误。 x86 division faults when the quotient overflows AL / AX / EAX / RAX, instead of silently truncating. 当商溢出AL / AX / EAX / RAX而不是默默截断时，x86除法会发生故障。 There's no way to disable that. 无法禁用它。

So compilers normally only use idiv after cdq or cqo , and div only after zeroing the high half, unless you use an intrinsic or inline asm to open yourself up to the possibility of your code faulting. 所以编译器通常只使用idiv后cdq或cqo ，和div只有零上半部后，除非您使用的是内在的或内联汇编来打开自己到你的代码出错的可能性。 In C, x / y only faults if y = 0 (or for signed, INT_MIN / -1 is also allowed to fault ¹ ). 在C语言中， x / y仅在y = 0发生故障（或者对于有符号， INT_MIN / -1也允许发生故障¹ ）。

GNU C doesn't have an intrinsic for wide division, but MSVC has _udiv64 . GNU C没有用于宽除的内在函数， 但是MSVC具有_udiv64 。 (With gcc/clang, division wider than 1 register uses a helper function which does try to optimize for small inputs. But this doesn't help for 64/32 division on a 64-bit machine, where GCC and clang just use the 128/64-bit division instruction.) （对于gcc / clang，大于1的寄存器除法使用辅助函数，该函数会尝试针对少量输入进行优化。但是，这对于64位计算机上的64/32除法没有帮助，其中GCC和clang仅使用128 / 64位除法指令。）

Even if there were some way to promise the compiler that your divisor would be big enough to make the quotient fit in 32 bits, current gcc and clang don't look for that optimization in my experience. 即使有某种方法可以向编译器保证您的除数足够大以使商适合32位，但根据我的经验，当前的gcc和clang并不会寻求这种优化。 It would be a useful optimization for your case (if it's always safe), but compilers won't look for it. 对于您的情况而言，这将是一个有用的优化（如果总是安全的话），但是编译器不会寻找它。

Footnote 1: To be more specific, ISO C describes those cases as "undefined behaviour"; 脚注1：更具体地说，ISO C将这些情况描述为“未定义的行为”。 some ISAs like ARM have non-faulting division instructions. 一些ISA（如ARM）具有无故障的划分指令。 C UB means anything can happen, including just truncation to 0 or some other integer result. C UB表示可能发生任何事情，包括仅截断为0或其他整数结果。 See Why does integer division by -1 (negative one) result in FPE? 请参见为什么将-1除以整数（负数）会导致FPE？ for an example of AArch64 vs. x86 code-gen and results. 有关AArch64与x86代码生成和结果的示例。 Allowed to fault doesn't mean required to fault. 允许故障并不意味着需要故障。

Answer 2

Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs? 在某些情况下，在x86-64 Intel / AMD CPU上，128bit / 64bit硬件无符号除法能否比64bit / 32bit除法更快？

In theory, anything is possible (eg maybe in 50 years time Nvidia creates an 80x86 CPU that ...). 从理论上讲，一切皆有可能（例如，在50年后，Nvidia会创建一个80x86 CPU ...）。

However, I can't think of a single plausible reason why a 128bit/64bit division would ever be faster than (not merely equivalent to) a 64bit/32bit division on x86-64. 但是，我想不出一个单一的合理原因，为什么在x86-64上128bit / 64bit的分割速度会比（不仅等同于）64bit / 32bit的分割速度更快。

I suspect this because I assume that the C compiler authors are very smart and so far I have failed to make the popular C compilers generate the latter code when dividing an unsigned 32-bit integer (shifted left 32 bits) by another 32-bit integer. 我怀疑这是因为我假设C编译器作者非常聪明，并且到目前为止，当我将无符号的32位整数（左移32位）除以另一个32位整数时，我未能使流行的C编译器生成后一个代码。。 It always compiles to the128bit/64bit div instruction. 它始终编译为128位/ 64位div指令。 PS The left shift compiles fine to shl . PS左移编译为shl很好。

Compiler developers are smart, but compilers are complex and the C language rules get in the way. 编译器开发人员很聪明，但是编译器很复杂，并且C语言规则妨碍了编译。 For example, if you just do a a = b/c; 例如，如果您只是执行a a = b/c; (with b being 64 bit and c being 32-bit) the language's rules are that c gets promoted to 64-bit before the division happens, so it ends up being a 64-bit divisor in some kind of intermediate language, and that makes it hard for the back-end translation (from intermediate language to assembly language) to tell that the 64-bit divisor could be a 32-bit divisor. （ b为64位， c为32位）时，该语言的规则是c在除法发生之前被提升为64位，因此最终以某种中间语言成为64位除数，这使得后端翻译（从中间语言到汇编语言）很难说出64位除数可以是32位除数。

在某些情况下，在x86-64 Intel / AMD CPU上，128bit / 64bit硬件无符号除法能否比64bit / 32bit除法更快？

问题描述

2 个解决方案

解决方案1
5 已采纳 2019-06-18 21:12:25

For the same inputs, 32-bit operand-size will always be at least as fast 对于相同的输入，32位操作数大小将始终至少与之一样快

解决方案2
2 2019-06-18 20:30:46

在某些情况下，在x86-64 Intel / AMD CPU上，128bit / 64bit硬件无符号除法能否比64bit / 32bit除法更快？

问题描述

2 个解决方案

解决方案1 5 已采纳 2019-06-18 21:12:25

For the same inputs, 32-bit operand-size will always be at least as fast 对于相同的输入，32位操作数大小将始终至少与之一样快

解决方案2 2 2019-06-18 20:30:46

解决方案1
5 已采纳 2019-06-18 21:12:25

解决方案2
2 2019-06-18 20:30:46