There is this talk, CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior..." , where Mr. Carruth shows an example from the bzip code. They have used uint32_t i1
as an index. On a 64-bit system the array access block[i1]
will then do *(block + i1)
. The issue is that block
is a 64-bit pointer whereas i1
is a 32-bit number. The addition might overflow and since unsigned integers have defined overflow behavior the compiler needs to add extra instructions to make sure that this is indeed fulfilled even on a 64-bit system.
I would like to also show this with a simple example. So I have tried the ++i
code with various signed and unsigned integers. The following is my test code:
#include <cstdint>
void test_int8() { int8_t i = 0; ++i; }
void test_uint8() { uint8_t i = 0; ++i; }
void test_int16() { int16_t i = 0; ++i; }
void test_uint16() { uint16_t i = 0; ++i; }
void test_int32() { int32_t i = 0; ++i; }
void test_uint32() { uint32_t i = 0; ++i; }
void test_int64() { int64_t i = 0; ++i; }
void test_uint64() { uint64_t i = 0; ++i; }
With g++ -c test.cpp
and objdump -d test.o
I get assembly listings like this:
000000000000004e <_Z10test_int32v>:
4e: 55 push %rbp
4f: 48 89 e5 mov %rsp,%rbp
52: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
59: 83 45 fc 01 addl $0x1,-0x4(%rbp)
5d: 90 nop
5e: 5d pop %rbp
5f: c3 retq
To be honest: My knowledge of x86 assembly is rather limited, so my following conclusions and questions may be very naive.
The first two instructions seem to be only from the call of a function, the last three ones seem to be the return value. Removing only these lines, the following kernels are left for the various data types:
int8_t
:
4: c6 45 ff 00 movb $0x0,-0x1(%rbp) 8: 0f b6 45 ff movzbl -0x1(%rbp),%eax c: 83 c0 01 add $0x1,%eax f: 88 45 ff mov %al,-0x1(%rbp)
uint8_t
:
19: c6 45 ff 00 movb $0x0,-0x1(%rbp) 1d: 80 45 ff 01 addb $0x1,-0x1(%rbp)
int16_t
:
28: 66 c7 45 fe 00 00 movw $0x0,-0x2(%rbp) 2e: 0f b7 45 fe movzwl -0x2(%rbp),%eax 32: 83 c0 01 add $0x1,%eax 35: 66 89 45 fe mov %ax,-0x2(%rbp)
uint16_t
:
40: 66 c7 45 fe 00 00 movw $0x0,-0x2(%rbp) 46: 66 83 45 fe 01 addw $0x1,-0x2(%rbp)
int32_t
:
52: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 59: 83 45 fc 01 addl $0x1,-0x4(%rbp)
uint32_t
:
64: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 6b: 83 45 fc 01 addl $0x1,-0x4(%rbp)
int64_t
:
76: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp) 7d: 00 7e: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
uint64_t
:
8a: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp) 91: 00 92: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
Comparing the signed with the unsigned versions I would have expected from Mr. Carruth's talk that extra masking instructions are generated.
But for int8_t
we load a byte ( movb
) into %rbp
, then load and zero-pad it to a long ( movzbl
) into the accumulator %eax
. The addition ( add
) is performed without any size specification because the overflow is not defined anyway. The unsigned version directly uses instructions for bytes.
Either both add
and addb
/ addw
/ addl
/ addq
take the same number of cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all sizes or the 32-bit unit does the masking internally and therefore has a longer latency.
I have looked for a table with latencies and found the one by agner.org . There for each CPU (using Sandy Bridge here) there is only one entry for ADD
but I do not see entries for the other width variants. The Intel 64 and IA-32 Architectures Optimization Reference Manual also seems to list only a single add
instruction.
Does this mean that on x86 the ++i
of non-native length integers is actually faster for unsigned types because there are less instructions?
There are two parts of this question: Chandler's point about optimizations based on overflow being undefined, and the differences you found in the assembly output.
Chandler's point is that if overflow is undefined behavior, then the compiler can assume that it cannot happen. Consider the following code:
typedef int T;
void CopyInts(int *dest, const int *src) {
T x = 0;
for (; src[x]; ++x) {
dest[x] = src[x];
}
}
Here, the compiler can safely change the for
loop to the following:
while (*src) {
*dest++ = *src++;
}
That's because the compiler does not have to worry about the case that x
overflows. If the compiler has to worry about x
overflowing, the source and destination pointers suddenly get 16 GB subtracted from them, so the simple transformation above will not work.
At the assembly level, the above is (with GCC 7.3.0 for x86-64, -O2
):
_Z8CopyIntsPiPKii:
movl (%rsi), %edx
testl %edx, %edx
je .L1
xorl %eax, %eax
.L3:
movl %edx, (%rdi,%rax)
addq $4, %rax
movl (%rsi,%rax), %edx
testl %edx, %edx
jne .L3
.L1:
rep ret
If we change T
to be unsigned int
, we get this slower code instead:
_Z8CopyIntsPiPKij:
movl (%rsi), %eax
testl %eax, %eax
je .L1
xorl %edx, %edx
xorl %ecx, %ecx
.L3:
movl %eax, (%rdi,%rcx)
leal 1(%rdx), %eax
movq %rax, %rdx
leaq 0(,%rax,4), %rcx
movl (%rsi,%rax,4), %eax
testl %eax, %eax
jne .L3
.L1:
rep ret
Here, the compiler is keeping x
as a separate variable so that overflow is handled properly.
Instead of relying on signed overflow being undefined for performance, you can use a size type that is the same size as a pointer. This means that such a variable could only overflow at the same time as a pointer, which is also undefined. Hence, at least for x86-64, size_t
would also work as T
to get the better performance.
Now for the second part of your question: the add
instruction. The suffixes on the add
instruction are from the so-called "AT&T" style of x86 assembly language. In AT&T assembly language, the parameters are backward from the way Intel writes instructions, and disambiguating instruction sizes is done by adding a suffix to the mnemonic instead of something like dword ptr
in the Intel case.
Example:
Intel: add dword ptr [eax], 1
AT&T: addl $1, (%eax)
These are the same instruction, just written differently. The l
takes the place of dword ptr
.
In the case where the suffix is missing from AT&T instructions, it's because it's not required: the size is implicit from the operands.
add $1, %eax
The l
suffix is unnecessary because the instruction is obviously 32-bit, because eax
is.
In short, it has nothing to do with overflow. Overflow is always defined at the processor level. On some architectures, such as when using the non- u
instructions on MIPS, overflow throws an exception, but it's still defined . C/C++ are the only major languages that make overflow unpredictable behavior.
Either both add and addb/addw/addl/addq take the same number of cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all sizes or the 32-bit unit does the masking internally and therefore has a longer latency.
First of all, it's a 64-bit adder because it supports qword add
with the same performance.
In hardware, masking bits doesn't take a whole extra clock cycle; one clock cycle is many gate-delays long. An enable/disable control signal can zero the results from the high half (for 32-bit operand size), or stop carry-propagation at 16 or 8 bits (for smaller operand-sizes that leave the upper bits unmodified instead of zero-extending).
So each execution port with an integer ALU execution unit probably uses the same adder transistors for all operand sizes, using control signals to modify its behaviour. Maybe even using it for XOR as well (by blocking all the carry signals).
I was going to write more about your misunderstanding of the optimization issue, but Myria already covered it.
See also What Every C Programmer Should Know About Undefined Behavior , an LLVM blog post that explains some of the ways UB allows optimization, including specifically promoting a counter to 64-bit or optimizing it away into pointer increments, instead of implementing signed wrap-around like you'd get if signed integer overflow was strictly defined as wrapping. (eg if you compile with gcc -fwrapv
, the opposite of -fstrict-overflow
)
Your un-optimized compiler output is pointless and doesn't tell us anything. The x86 add
instruction implements both unsigned and signed-2's-complement addition, because those are both the same binary operation . The different code-gen at -O0
is merely an artefact of compiler internals, not anything fundamental that would happen in real code (with -O2
or -O3
).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.