简体   繁体   English

VS:_BitScanReverse64固有的意外优化行为

[英]VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

The following code works fine in debug mode, since _BitScanReverse64 is defined to return 0 if no Bit is set. 以下代码在调试模式下可以正常工作,因为_BitScanReverse64被定义为在未设置Bit的情况下返回0。 Citing MSDN : (The return value is) "Nonzero if Index was set, or 0 if no set bits were found." 引用MSDN :(返回值为)“如果设置了索引,则为非零;如果未找到设置位,则为0”。

If I compile this code in release mode it still works, but if I enable compiler optimizations, such as \\O1 or \\O2 the index is not zero and the assert() fails. 如果我以发布模式编译此代码,它仍然可以工作,但是如果启用了编译器优化(例如\\ O1或\\ O2),则索引不为零,并且assert()失败。

#include <iostream>
#include <cassert>

using namespace std;

int main()
{
  unsigned long index = 0;
  _BitScanReverse64(&index, 0x0ull);

  cout << index << endl;

  assert(index == 0);

  return 0;
}

Is this the intended behaviour ? 这是预期的行为吗? I am using Visual Studio Community 2015, Version 14.0.25431.01 Update 3. (I left cout in, so that the variable index is not deleted during optimization). 我正在使用Visual Studio Community 2015,版本14.0.25431.01更新3。(我留了cout,以便在优化过程中不删除变量索引)。 Also is there an efficient workaround or should I just not use this compiler intrinsic directly? 还有一种有效的解决方法,还是我不应该直接使用此内在编译器?

AFAICT, the intrinsic leaves garbage in index when the input is zero, weaker than the behaviour of the asm instruction. AFAICT, 当输入为零时内在函数在index留下垃圾,比asm指令的行为弱。 This is why it has a separate boolean return value and integer output operand. 这就是为什么它具有单独的布尔返回值和整数输出操作数的原因。

Despite the index arg being taken by reference, the compiler treats it as output-only. 尽管index arg被引用,编译器仍将其视为仅输出。


unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel's intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. 英特尔针对相同内在函数的内在函数指南文档似乎比您链接的Microsoft文档更清晰,并且为MS文档试图说的内容提供了一些启示。 But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the bsr instruction. 但是仔细阅读后,他们似乎都说了同样的话,并在bsr指令周围描述了一个薄包装。

Intel documents the BSR instruction as producing an "undefined value" when the input is 0, but setting the ZF in that case. 当输入为0时, Intel将BSR指令记录为产生“未定义的值”,但在这种情况下设置ZF。 But AMD documents it as leaving the destination unchanged: 但是,AMD记录为目标不变:

AMD's BSF entry in AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions AMD的BSFAMD64体系结构程序员手册第3卷:通用和系统说明中的 条目

... If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. ...如果第二个操作数包含0,则指令将ZF设置为1,并且不更改目标寄存器的内容。 ... ...

On current Intel hardware, the actual behaviour matches AMD's documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting Index when the input is non-zero (and the intrinsic's return value is non-zero). 在当前的Intel硬件上,实际行为与AMD的文档相符:当src操作数为0时,目标寄存器保持不变。这也许就是为什么MS将其描述为仅在输入为非零时设置Index (并且内部函数的返回值为非零)。

On Intel ( but maybe not AMD ), this goes as far as not even truncating a 64-bit register to 32-bit. 在Intel( 但可能不是AMD )上,这甚至不会将64位寄存器截断为32位。 eg mov rax,-1 ; 例如mov rax,-1 ; bsf eax, ecx (with zeroed ECX) leaves RAX=-1 (64-bit), not the 0x00000000ffffffff you'd get from xor eax, 0 . bsf eax, ecx (ECX为零)离开RAX = -1(64位),而不是从xor eax, 0获得的0x00000000ffffffff But with non-zero ECX, bsf eax, ecx has the usual effect of zero-extending into RAX, leaving for example RAX=3. 但是对于非零ECX, bsf eax, ecx具有零扩展到RAX的通常效果,例如留下RAX = 3。


IDK why Intel still hasn't documented it. IDK为什么英特尔仍未对此进行记录。 Perhaps a really old x86 CPU (like original 386?) implements it differently? 也许真正的旧x86 CPU(例如原始386?)以不同的方式实现它? Intel and AMD frequently go above and beyond what's documented in the x86 manuals in order to not break existing widely-used code (eg Windows) , which might be how this started. 英特尔和AMD经常超越x86手册中记录的内容,以免破坏现有的广泛使用的代码(例如Windows) ,这可能就是这样开始的。

At this point it seems unlikely that Intel will ever drop that output dependency and leave actual garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open. 在这一点上,英特尔似乎不太可能会放弃对输出的依赖性,并让实际的垃圾或输入== 0时为-1或32,但是缺少文档使该选项处于打开状态。

Skylake dropped the false dependency for lzcnt and tzcnt (and a later uarch dropped the false dep for popcnt ) while still preserving the dependency for bsr / bsf . SKYLAKE微架构放弃了假依赖于lzcnttzcnt (和稍后的uarch下降虚假DEP的popcnt ),同时仍然保留依赖bsr / bsf ( Why does breaking the "output dependency" of LZCNT matter? ) 为什么破坏LZCNT的“输出依赖性”很重要?


Of course, since MSVC optimized away your index = 0 initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. 当然,由于MSVC优化了index = 0初始化,因此大概只使用它想要的任何目标寄存器,而不必使用保存C变量先前值的寄存器。 So even if you wanted to, I don't think you could take advantage of the dst-unmodified behaviour even though it's guaranteed on AMD. 因此,即使您愿意,我也不认为您可以利用未经修改的dst行为,即使AMD对此有保证。

So in C++ terms, the intrinsic has no input dependency on index . 因此,以C ++术语来说,内在函数对index没有输入依赖性 But in asm, the instruction does have an input dependency on the dst register, like an add dst, src instruction. 但是在asm中,该指令确实对dst寄存器具有输入依赖性,就像add dst, src指令一样。 This can cause unexpected performance issues if compilers aren't careful. 如果编译器不小心,可能会导致意外的性能问题。

Unfortunately on Intel hardware, the popcnt / lzcnt / tzcnt asm instructions also have a false dependency on their destination , even though the result never depends on it. 不幸的是,在Intel硬件 ,即使结果从不依赖于popcnt / lzcnt / tzcnt asm指令,它们也对目标地址有错误的依赖 Compilers work around this now that it's known, though, so you don't have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered). 不过,编译器现在已经知道了,因此可以解决此问题,因此,在使用内部函数时,您不必担心它(除非您拥有超过两年的编译器,因为它是最近才发现的)。


You need to check it to make sure index is valid, unless you know the input was non-zero. 您需要检查它以确保index有效,除非您知道输入为非零。 eg 例如

if(_BitScanReverse64(&idx, input)) {
    // idx is valid.
    // (MS docs say "Index was set")
} else {
    // input was zero, idx holds garbage.
    // (MS docs don't say Index was even set)
    idx = -1;     // might make sense, one lower than the result for bsr(1)
}

If you want to avoid this extra check branch, you can use the lzcnt instruction via different intrinsics if you're targeting new enough hardware (eg Intel Haswell or AMD Bulldozer IIRC). 如果要避免执行此额外的检查分支,那么如果您要瞄准足够新的硬件(例如Intel Haswell或AMD Bulldozer IIRC),则可以通过不同的内在函数使用lzcnt指令 It "works" even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit. 即使输入全零,它也“起作用”,并且实际上计数前导零而不是返回最高设置位的索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM