简体   繁体   English

有关ADC的-1(0xFFFFFFFF)有什么特别之处吗?

[英]Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

In a research project of mine I'm writing C++ code. 在我的一个研究项目中,我正在编写C ++代码。 However, the generated assembly is one of the crucial points of the project. 但是,生成的程序集是项目的关键点之一。 C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. C ++不提供对标志操作指令的直接访问,特别是对ADC直接访问,但如果编译器足够聪明地使用它,这不应成为问题。 Consider: 考虑:

constexpr unsigned X = 0;

unsigned f1(unsigned a, unsigned b) {
    b += a;
    unsigned c = b < a;
    return c + b + X;
}

Variable c is a workaround to get my hands on the carry flag and add it to b and X . 变量c是一种解决方法,可以将手放在进位标志上并将其添加到bX It looks I got luck and the ( g++ -O3 , version 9.1) generated code is this: 它看起来很幸运,( g++ -O3 ,版本9.1)生成的代码是这样的:

f1(unsigned int, unsigned int):
 add %edi,%esi
 mov %esi,%eax
 adc $0x0,%eax
 retq 

For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). 对于我测试的所有X值,代码如上所述(当然,除了相应更改的立即值$0x0 )。 I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u , ... it really doesn't matter how you spell it) the generated code is: 我发现了一个例外:当X == -1 (或0xFFFFFFFFu~0u ,......你的拼写方式并不重要)时,生成的代码为:

f1(unsigned int, unsigned int):
 xor %eax,%eax
 add %edi,%esi
 setb %al
 lea -0x1(%rsi,%rax,1),%eax
 retq 

This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? 这似乎比间接测量建议的初始代码效率低(虽然不是很科学) 我是对的吗? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting? 如果是这样, 这是一个值得报道的“缺少优化机会”的错误吗?

For what is worth, clang -O3 , version 8.8.0, always uses ADC (as I wanted) and icc -O3 , version 19.0.1 never does. 值得一提的是, clang -O3 ,版本8.8.0,总是使用ADC (我想要的)和icc -O3 ,版本19.0.1永远不会。

I've tried using the intrinsic _addcarry_u32 but it didn't help. 我已经尝试使用内在的_addcarry_u32但它没有帮助。

unsigned f2(unsigned a, unsigned b) {
    b += a;
    unsigned char c = b < a;
    _addcarry_u32(c, b, X, &b);
    return b;
}

I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). 我估计我可能没有正确使用_addcarry_u32 (我找不到太多信息)。 What's the point of using it since it's up to me to provide the carry flag? 使用它有什么意义,因为由我来提供进位标志? (Again, introducing c and praying for the compiler to understand the situation.) (再次,引入c并祈祷编译器了解情况。)

I might, actually, be using it correctly. 实际上,我可能正确使用它。 For X == 0 I'm happy: 对于X == 0我很高兴:

f2(unsigned int, unsigned int):
 add %esi,%edi
 mov %edi,%eax
 adc $0x0,%eax
 retq 

For X == -1 I'm unhappy :-( 对于X == -1我很不高兴:-(

f2(unsigned int, unsigned int):
 add %esi,%edi
 mov $0xffffffff,%eax
 setb %dl
 add $0xff,%dl
 adc %edi,%eax
 retq 

I do get the ADC but this is clearly not the most efficient code. 我确实得到了ADC但这显然不是最有效的代码。 (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!) (什么dl在那里做什么?两条指令读进位标志和恢复吗?真的吗?我希望我是非常错误的!)

mov + adc $-1, %eax is more efficient than xor -zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs. mov + adc $-1, %eax对于大多数CPU上的延迟和uop计数adc $-1, %eaxxor -zero + setc + 3-component lea更有效,并且在任何仍然相关的CPU上都没有更糟。 1 1


This looks like a gcc missed optimization : it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening. 这看起来像一个gcc错过了优化 :它可能会看到一个特殊的情况并锁定它,在脚中射击自己并防止adc模式识别发生。

I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. 我不知道它看到/正在寻找什么,所以是的,你应该报告这是一个错过优化的错误。 Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. 或者如果你想深入挖掘,你可以在优化过程后查看GIMPLE或RTL输出,看看会发生什么。 If you know anything about GCC's internal representations. 如果您对GCC的内部陈述有所了解。 Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler". Godbolt有一个GIMPLE树转储窗口,您可以从与“克隆编译器”相同的下拉列表中添加。


The fact that clang compiles it with adc proves that it's legal, ie that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. clang用adc编译它的事实证明它是合法的,即你想要的asm与C ++源匹配,并且你没有错过一些阻止编译器进行优化的特殊情况。 (Assuming clang is bug-free, which is the case here.) (假设clang没有bug,这就是这种情况。)

That problem can certainly happen if you're not careful, eg trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. 如果你不小心这个问题肯定会发生,例如试图编写一个通用的adc函数,这个函数需要进位并且在C中提供3输入加法的执行很难,因为这两个加法中的任何一个都可以携带因此,在将进位添加到其中一个输入之后,您不能只使用sum < a+b惯用语。 I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out. 我不确定是否有可能让gcc或clang发出add/adc/adc ,而中间adc必须携带进位并产生进位。

eg 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1 . 例如0xff...ff + 1绕到0,所以sum = a+b+carry_in / carry_out = sum < a无法优化到adc因为它需要忽略a = -1的特殊情况下进位carry_in = 1

So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. 所以另一个猜测是,gcc可能会考虑更早地做+ X ,并因为特殊情况而在脚下射击。 That doesn't make a lot of sense, though. 但这并没有多大意义。


What's the point of using it since it's up to me to provide the carry flag? 使用它有什么意义,因为由我来提供进位标志?

You're using _addcarry_u32 correctly. 您正确使用了_addcarry_u32

The point of its existence is to let you express an add with carry in as well as carry out , which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF. 其存在的问题是让你表达一个与位加,以及开展 ,这是很难在纯C. GCC和铛不优化得很好,往往不只是保持在CF的套利结果

If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc , but still give you the carry-out as a C variable. 如果你只想进行结转,你可以提供一个0作为进位,它将优化add而不是adc ,但仍然给你作为C变量的结转。

eg to add two 128-bit integers in 32-bit chunks, you can do this 例如,要在32位块中添加两个128位整数,就可以执行此操作

// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
    unsigned char carry;
    carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
    carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
    carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
    carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}

( On Godbolt with GCC/clang/ICC ) 关于GCC / clang / ICC的Godbolt

That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add / adc / adc / adc . 对于unsigned __int128 ,这是非常低效的,其中编译器只使用64位add / adc,但确实会让clang和ICC发出add / adc / adc / adc链。 GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc . GCC弄得一团糟,使用setcc将CF存储为某个步骤的整数,然后add dl, -1将其放回CF中以获取adc

GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. 不幸的是,海湾合作委员会用纯C语言编写的扩展精度/大整数很糟糕.Clang有时确实稍好一些,但大多数编译器都不好。 This is why the lowest-level gmplib functions are hand-written in asm for most architectures. 这就是为什么最低级别的gmplib函数是在大多数架构的asm中手写的。


Footnote 1 : or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop. 脚注1 :或者对于uop计数:在Intel Haswell上等于早期,其中adc是2 uops,除了在Sandybridge家族的解码器特殊情况下为0 uop时立即为零。

But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse. 但是具有base + index + disp的3分量LEA使其成为Intel CPU的3周期延迟指令,因此它肯定更糟。

On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA. 在英特尔Broadwell及更高版本中,即使具有非零立即数, adc也是一个1-uop指令,利用Haswell为FMA引入的3输入微指令的支持。

So equal total uop count but worse latency means that adc would still be a better choice. 如此相等的总uop计数但更差的延迟意味着adc仍然是更好的选择。

https://agner.org/optimize/ https://agner.org/optimize/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM