为什么 static_cast 转换会加速我的 integer 分区 function 的未优化构建？

Question

... or rather, why does not static_cast-ing slow down my function? ... 或者更确切地说，为什么 static_cast-ing不会减慢我的 function？

Consider the function below, which performs integer division:考虑下面的 function，它执行 integer 除法：

int Divide(int x, int y) {
  int ret = 0, i = 32;
  long j = static_cast<long>(y) << i;
  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret += 1 << i, x -= j;
  }
  return ret;
}

This performs reasonably well, as one might expect.正如人们所期望的那样，这表现相当不错。 However, if we remove the static_cast on line 3, like so:但是，如果我们删除第 3 行的static_cast ，如下所示：

int Divide(int x, int y) {
  int ret = 0, i = 32;
  long j = y << i;
  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret += 1 << i, x -= j;
  }
  return ret;
}

This version performs noticeably slower, sometimes several hundreds times slower (I haven't measured rigorously, but shouldn't be far off) for pathological inputs where x is large and y is small.对于x大而y小的病理输入，此版本的执行速度明显较慢，有时会慢数百倍（我没有严格测量，但应该相差不远）。 I was curious and wanted to look into why, and tried digging into the assembly code.我很好奇并想研究原因，并尝试深入研究汇编代码。 However, apart from the casting differences on line 3, I get the exact same output. Here's the line 3 output for reference ( source ):但是，除了第 3 行的铸造差异外，我得到完全相同的 output。这是第 3 行 output 供参考（来源）：

With static_cast :使用static_cast ：

movsxd  rax, dword ptr [rbp - 8]
mov     ecx, dword ptr [rbp - 16]
shl     rax, cl
mov     qword ptr [rbp - 24], rax

Without static_cast :没有static_cast ：

mov     eax, dword ptr [rbp - 8]
mov     ecx, dword ptr [rbp - 16]
shl     eax, cl
cdqe
mov     qword ptr [rbp - 24], rax

The rest is identical. rest 是相同的。

I'm really curious where the overhead is occurring.我真的很好奇开销在哪里发生。

EDIT : I've tested a bit further, and it looks like the while loop is where most of the time is spent, not when y is initialized.编辑：我进一步测试了一下，看起来大部分时间都花在了 while 循环上，而不是在初始化y时。 The additional cdqe instruction doesn't seem to be significant enough to warrant the total increase in wall time.额外的cdqe指令似乎不足以保证墙上时间的总增加。

Some disclaimers, since I've been getting a lot of comments peripheral to the actual question:一些免责声明，因为我收到了很多与实际问题无关的评论：

I'm aware that shifting an int further than 32 bits is UB.我知道将 int 移动超过 32 位是 UB。
I'm assuming only positive inputs.我假设只有积极的投入。
long is 8 bytes long on my platform, so it doesn't overflow. long在我的平台上是 8 个字节长，所以它不会溢出。

I'd like to know what might be causing the increased runtime, which the comments criticizing the above don't actually address.我想知道可能导致运行时间增加的原因，批评上述内容的评论实际上并未解决。

Answer 1

Widening after the shift reduces your loop to naive repeated subtraction移位后加宽将您的循环减少到天真的重复减法

It's not the run-time of cdqe or movsxd vs. mov that's relevant, it's the different starting values for your loop, resulting in a different iteration count, especially for pathological cases.相关的不是cdqe或movsxd与mov的运行时间，而是循环的不同起始值，导致不同的迭代计数，尤其是对于病态情况。

Clang without optimization compiled your source exactly the way it was written, doing the shift on an int and then sign-extending the result to long . Clang 在没有优化的情况下完全按照编写的方式编译您的源代码，对int进行移位，然后将结果符号扩展为long 。 The shift-count UB is invisible to the compiler with optimization disabled because, for consistent debugging, it assumes variable values can change between statements , so the behaviour depends on what the target machine does with a shift-count by the operand-size.移位计数 UB 对于禁用优化的编译器是不可见的，因为为了一致的调试，它假定变量值可以在语句之间更改，因此行为取决于目标机器对操作数大小的移位计数执行的操作。

When compiling for x86-64, that results in long j = (long)(y<<0);为 x86-64 编译时，结果为long j = (long)(y<<0); , ie long j = y; ，即long j = y; , rather than having those bits at the top of a 64-bit value. ，而不是将这些位放在 64 位值的顶部。

x86 scalar shifts like shl eax, cl mask the count with &31 (except with 64-bit operand size) so the shift used a count of 32 % 32 == 0 . x86 标量移位，如shl eax, cl用&31屏蔽计数（64 位操作数大小除外），因此移位使用的计数为32 % 32 == 0 。 AArch64 would I think saturate the shift count, ie let you shift out all the bits.我认为 AArch64 会使移位计数饱和，即让您移出所有位。

Notice that it does a 32-bit operand-size shl eax, cl and then sign-extends the result with cdqe , instead of doing a sign-extending reload of y and then a 64-bit operand-size shl rax,cl .请注意，它先执行 32 位操作数大小的shl eax, cl ，然后使用cdqe对结果进行符号扩展，而不是对y进行符号扩展重新加载，然后执行 64 位操作数大小的shl rax,cl 。

Your loop has a data-dependent iteration count您的循环具有依赖于数据的迭代计数

If you single-step with a debugger, you could see the local variable values accurately.如果您使用调试器单步执行，您可以准确地看到局部变量值。 (That's the main benefit of an un-optimized debug build, which is not what you should be benchmarking .) And you can count iterations. （这是未优化的调试构建的主要好处，这不是您应该进行基准测试的对象。）并且您可以计算迭代次数。

  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret += 1 << i, x -= j;
  }

With j = y , if we enter the outer loop at all, then the inner loop condition is always false.对于j = y ，如果我们完全进入外循环，则内循环条件始终为假。
So it never runs, j stays constant the whole time, and i stays constant at 32.所以它永远不会运行， j一直保持不变，而i保持不变为 32。

1<<32 again compiles to a variable-count shift with 32-bit operand-size, because 1 has type int . 1<<32再次编译为具有 32 位操作数大小的可变计数移位，因为1的类型为int 。 ( 1LL has type long long , and can safely be left-shifted by 32). （ 1LL的类型为long long ，可以安全地左移 32 位）。 On x86-64, this is just a slow way to do ret += 1;在 x86-64 上，这只是一种执行ret += 1; . .

x -= j; is of course just x -= y;当然只是x -= y; , so we're counting how many subtractions to make x < y . ，所以我们正在计算使x < y减去多少次。

It's well-known that division by repeated subtraction is extremely slow for large quotients, since the run time scales linearly with the quotient.众所周知，对于大商，通过重复减法进行除法非常慢，因为运行时间与商成线性关系。

You do happen to get the right result, though.不过，您确实碰巧得到了正确的结果。 Yay.好极了。

BTW, long is only 32-bit on some targets like Windows x64 and 32-bit platforms;顺便说一句， long在一些目标上只有 32 位，比如 Windows x64 和 32 位平台； use long long or int64_t if you want a type twice the width of int .如果您想要宽度是int两倍的类型，请使用long long或int64_t 。 And maybe static_assert to make sure int isn't that wide.也许 static_assert 可以确保int没有那么宽。

With optimization enabled, I think the same things would still hold true: clang looks like it's compiling to similar asm just without the store/reload.启用优化后，我认为同样的事情仍然适用：clang 看起来它正在编译为类似的 asm，只是没有存储/重新加载。 So it's effectively / de-facto defining the behaviour of 1<<32 to just compile to an x86 shift instruction.因此，它实际上/事实上定义了1<<32的行为，仅编译为 x86 移位指令。

But I didn't test, that's just from a quick look at the asm https://godbolt.org/z/M33vqGj5P and noting things like mov r8d, 1 ;但我没有测试，这只是快速浏览一下 asm https://godbolt.org/z/M33vqGj5P并注意到诸如mov r8d, 1类的东西； shl r8d, cl (32-bit operand-size); shl r8d, cl （32 位操作数大小）； add eax, r8d

Answer 2

I'm keeping this answer up for now as the comments are useful.我暂时保留这个答案，因为评论很有用。

Answer 3

 int Divide(int x, int y) { int ret = 0, i = 32; long j = y << i;

On most systems, the size of int is 32 bits or less.在大多数系统上， int的大小为 32 位或更少。 Left-shifting a signed integer by equal or higher number of bits as its size results in undefined behaviour.将带符号的 integer 左移与其大小相同或更多的位数会导致未定义的行为。 Don't do this.不要这样做。 Since the program is broken, it's irrelevant whether it's slower or faster.由于程序已损坏，因此无论速度变慢还是变快都无关紧要。

_{Sidenote: Left shifting a signed 32 bit integer by 31 or fewer bits may also be undefined if that shift causes the left most bit to change due to arithmetic overflow.}_{旁注：将带符号的 32 位 integer 左移 31 位或更少位也可能是未定义的，如果该移位导致最左边的位由于算术溢出而改变。}

为什么 static_cast 转换会加速我的 integer 分区 function 的未优化构建？

问题描述

3 个解决方案

解决方案1
9 已采纳 2022-03-17 15:15:27

Widening after the shift reduces your loop to naive repeated subtraction移位后加宽将您的循环减少到天真的重复减法

Your loop has a data-dependent iteration count您的循环具有依赖于数据的迭代计数

解决方案2
2

解决方案3
2 2022-03-17 08:50:35

为什么 static_cast 转换会加速我的 integer 分区 function 的未优化构建？

问题描述

3 个解决方案

解决方案1 9 已采纳 2022-03-17 15:15:27

Widening after the shift reduces your loop to naive repeated subtraction移位后加宽将您的循环减少到天真的重复减法

Your loop has a data-dependent iteration count您的循环具有依赖于数据的迭代计数

解决方案2 2

解决方案3 2 2022-03-17 08:50:35

解决方案1
9 已采纳 2022-03-17 15:15:27

解决方案2
2

解决方案3
2 2022-03-17 08:50:35