[英]Is there anything special about -1 (0xFFFFFFFF) regarding ADC?
In a research project of mine I'm writing C++ code. 在我的一个研究项目中,我正在编写C ++代码。 However, the generated assembly is one of the crucial points of the project.
但是,生成的程序集是项目的关键点之一。 C++ doesn't provide direct access to flag manipulating instructions, in particular, to
ADC
but this shouldn't be a problem provided the compiler is smart enough to use it. C ++不提供对标志操作指令的直接访问,特别是对
ADC
直接访问,但如果编译器足够聪明地使用它,这不应成为问题。 Consider: 考虑:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c
is a workaround to get my hands on the carry flag and add it to b
and X
. 变量
c
是一种解决方法,可以将手放在进位标志上并将其添加到b
和X
It looks I got luck and the ( g++ -O3
, version 9.1) generated code is this: 它看起来很幸运,(
g++ -O3
,版本9.1)生成的代码是这样的:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X
that I've tested the code is as above (except, of course for the immediate value $0x0
that changes accordingly). 对于我测试的所有
X
值,代码如上所述(当然,除了相应更改的立即值$0x0
)。 I found one exception though: when X == -1
(or 0xFFFFFFFFu
or ~0u
, ... it really doesn't matter how you spell it) the generated code is: 我发现了一个例外:当
X == -1
(或0xFFFFFFFFu
或~0u
,......你的拼写方式并不重要)时,生成的代码为:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? 这似乎比间接测量建议的初始代码效率低(虽然不是很科学) 我是对的吗? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
如果是这样, 这是一个值得报道的“缺少优化机会”的错误吗?
For what is worth, clang -O3
, version 8.8.0, always uses ADC
(as I wanted) and icc -O3
, version 19.0.1 never does. 值得一提的是,
clang -O3
,版本8.8.0,总是使用ADC
(我想要的)和icc -O3
,版本19.0.1永远不会。
I've tried using the intrinsic _addcarry_u32
but it didn't help. 我已经尝试使用内在的
_addcarry_u32
但它没有帮助。
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32
correctly (I couldn't find much info on it). 我估计我可能没有正确使用
_addcarry_u32
(我找不到太多信息)。 What's the point of using it since it's up to me to provide the carry flag? 使用它有什么意义,因为由我来提供进位标志? (Again, introducing
c
and praying for the compiler to understand the situation.) (再次,引入
c
并祈祷编译器了解情况。)
I might, actually, be using it correctly. 实际上,我可能正确使用它。 For
X == 0
I'm happy: 对于
X == 0
我很高兴:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1
I'm unhappy :-( 对于
X == -1
我很不高兴:-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC
but this is clearly not the most efficient code. 我确实得到了
ADC
但这显然不是最有效的代码。 (What's dl
doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!) (什么
dl
在那里做什么?两条指令读进位标志和恢复吗?真的吗?我希望我是非常错误的!)
mov
+ adc $-1, %eax
is more efficient than xor
-zero + setc
+ 3-component lea
for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs. mov
+ adc $-1, %eax
对于大多数CPU上的延迟和uop计数adc $-1, %eax
比xor
-zero + setc
+ 3-component lea
更有效,并且在任何仍然相关的CPU上都没有更糟。 1 1
This looks like a gcc missed optimization : it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc
pattern recognition from happening. 这看起来像一个gcc错过了优化 :它可能会看到一个特殊的情况并锁定它,在脚中射击自己并防止
adc
模式识别发生。
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. 我不知道它看到/正在寻找什么,所以是的,你应该报告这是一个错过优化的错误。 Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens.
或者如果你想深入挖掘,你可以在优化过程后查看GIMPLE或RTL输出,看看会发生什么。 If you know anything about GCC's internal representations.
如果您对GCC的内部陈述有所了解。 Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
Godbolt有一个GIMPLE树转储窗口,您可以从与“克隆编译器”相同的下拉列表中添加。
The fact that clang compiles it with adc
proves that it's legal, ie that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. clang用
adc
编译它的事实证明它是合法的,即你想要的asm与C ++源匹配,并且你没有错过一些阻止编译器进行优化的特殊情况。 (Assuming clang is bug-free, which is the case here.) (假设clang没有bug,这就是这种情况。)
That problem can certainly happen if you're not careful, eg trying to write a general-case adc
function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b
idiom after adding the carry to one of the inputs. 如果你不小心这个问题肯定会发生,例如试图编写一个通用的
adc
函数,这个函数需要进位并且在C中提供3输入加法的执行很难,因为这两个加法中的任何一个都可以携带因此,在将进位添加到其中一个输入之后,您不能只使用sum < a+b
惯用语。 I'm not sure it's possible to get gcc or clang to emit add/adc/adc
where the middle adc
has to take carry-in and produce carry-out. 我不确定是否有可能让gcc或clang发出
add/adc/adc
,而中间adc
必须携带进位并产生进位。
eg 0xff...ff + 1
wraps around to 0, so sum = a+b+carry_in
/ carry_out = sum < a
can't optimize to an adc
because it needs to ignore carry in the special case where a = -1
and carry_in = 1
. 例如
0xff...ff + 1
绕到0,所以sum = a+b+carry_in
/ carry_out = sum < a
无法优化到adc
因为它需要忽略在a = -1
的特殊情况下进位carry_in = 1
。
So another guess is that maybe gcc considered doing the + X
earlier, and shot itself in the foot because of that special case. 所以另一个猜测是,gcc可能会考虑更早地做
+ X
,并因为特殊情况而在脚下射击。 That doesn't make a lot of sense, though. 但这并没有多大意义。
What's the point of using it since it's up to me to provide the carry flag?
使用它有什么意义,因为由我来提供进位标志?
You're using _addcarry_u32
correctly. 您正确使用了
_addcarry_u32
。
The point of its existence is to let you express an add with carry in as well as carry out , which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF. 其存在的问题是让你表达一个与进位加,以及开展 ,这是很难在纯C. GCC和铛不优化得很好,往往不只是保持在CF的套利结果
If you only want carry-out, you can provide a 0
as the carry in and it will optimize to add
instead of adc
, but still give you the carry-out as a C variable. 如果你只想进行结转,你可以提供一个
0
作为进位,它将优化add
而不是adc
,但仍然给你作为C变量的结转。
eg to add two 128-bit integers in 32-bit chunks, you can do this 例如,要在32位块中添加两个128位整数,就可以执行此操作
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
( On Godbolt with GCC/clang/ICC ) ( 关于GCC / clang / ICC的Godbolt )
That's very inefficient vs. unsigned __int128
where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add
/ adc
/ adc
/ adc
. 对于
unsigned __int128
,这是非常低效的,其中编译器只使用64位add / adc,但确实会让clang和ICC发出add
/ adc
/ adc
/ adc
链。 GCC makes a mess, using setcc
to store CF to an integer for some of the steps, then add dl, -1
to put it back into CF for an adc
. GCC弄得一团糟,使用
setcc
将CF存储为某个步骤的整数,然后add dl, -1
将其放回CF中以获取adc
。
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. 不幸的是,海湾合作委员会用纯C语言编写的扩展精度/大整数很糟糕.Clang有时确实稍好一些,但大多数编译器都不好。 This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
这就是为什么最低级别的gmplib函数是在大多数架构的asm中手写的。
Footnote 1 : or for uop count: equal on Intel Haswell and earlier where adc
is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop. 脚注1 :或者对于uop计数:在Intel Haswell上等于早期,其中
adc
是2 uops,除了在Sandybridge家族的解码器特殊情况下为0 uop时立即为零。
But the 3-component LEA with a base + index + disp
makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse. 但是具有
base + index + disp
的3分量LEA使其成为Intel CPU的3周期延迟指令,因此它肯定更糟。
On Intel Broadwell and later, adc
is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA. 在英特尔Broadwell及更高版本中,即使具有非零立即数,
adc
也是一个1-uop指令,利用Haswell为FMA引入的3输入微指令的支持。
So equal total uop count but worse latency means that adc
would still be a better choice. 如此相等的总uop计数但更差的延迟意味着
adc
仍然是更好的选择。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.