Performance difference of signed and unsigned integers of non-native length

Question

There is this talk, CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior..." , where Mr. Carruth shows an example from the bzip code. They have used uint32_t i1 as an index. On a 64-bit system the array access block[i1] will then do *(block + i1) . The issue is that block is a 64-bit pointer whereas i1 is a 32-bit number. The addition might overflow and since unsigned integers have defined overflow behavior the compiler needs to add extra instructions to make sure that this is indeed fulfilled even on a 64-bit system.

I would like to also show this with a simple example. So I have tried the ++i code with various signed and unsigned integers. The following is my test code:

#include <cstdint>

void test_int8() { int8_t i = 0; ++i; }
void test_uint8() { uint8_t i = 0; ++i; }

void test_int16() { int16_t i = 0; ++i; }
void test_uint16() { uint16_t i = 0; ++i; }

void test_int32() { int32_t i = 0; ++i; }
void test_uint32() { uint32_t i = 0; ++i; }

void test_int64() { int64_t i = 0; ++i; }
void test_uint64() { uint64_t i = 0; ++i; }

With g++ -c test.cpp and objdump -d test.o I get assembly listings like this:

000000000000004e <_Z10test_int32v>:
  4e:   55                      push   %rbp
  4f:   48 89 e5                mov    %rsp,%rbp
  52:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  59:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  5d:   90                      nop
  5e:   5d                      pop    %rbp
  5f:   c3                      retq

To be honest: My knowledge of x86 assembly is rather limited, so my following conclusions and questions may be very naive.

The first two instructions seem to be only from the call of a function, the last three ones seem to be the return value. Removing only these lines, the following kernels are left for the various data types:

int8_t :

  4: c6 45 ff 00 movb $0x0,-0x1(%rbp) 8: 0f b6 45 ff movzbl -0x1(%rbp),%eax c: 83 c0 01 add $0x1,%eax f: 88 45 ff mov %al,-0x1(%rbp)

uint8_t :

  19: c6 45 ff 00 movb $0x0,-0x1(%rbp) 1d: 80 45 ff 01 addb $0x1,-0x1(%rbp)

int16_t :

  28: 66 c7 45 fe 00 00 movw $0x0,-0x2(%rbp) 2e: 0f b7 45 fe movzwl -0x2(%rbp),%eax 32: 83 c0 01 add $0x1,%eax 35: 66 89 45 fe mov %ax,-0x2(%rbp)

uint16_t :

  40: 66 c7 45 fe 00 00 movw $0x0,-0x2(%rbp) 46: 66 83 45 fe 01 addw $0x1,-0x2(%rbp)

int32_t :

  52: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 59: 83 45 fc 01 addl $0x1,-0x4(%rbp)

uint32_t :

  64: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 6b: 83 45 fc 01 addl $0x1,-0x4(%rbp)

int64_t :

  76: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp) 7d: 00 7e: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)

uint64_t :

  8a: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp) 91: 00 92: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)

Comparing the signed with the unsigned versions I would have expected from Mr. Carruth's talk that extra masking instructions are generated.

But for int8_t we load a byte ( movb ) into %rbp , then load and zero-pad it to a long ( movzbl ) into the accumulator %eax . The addition ( add ) is performed without any size specification because the overflow is not defined anyway. The unsigned version directly uses instructions for bytes.

Either both add and addb / addw / addl / addq take the same number of cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all sizes or the 32-bit unit does the masking internally and therefore has a longer latency.

I have looked for a table with latencies and found the one by agner.org . There for each CPU (using Sandy Bridge here) there is only one entry for ADD but I do not see entries for the other width variants. The Intel 64 and IA-32 Architectures Optimization Reference Manual also seems to list only a single add instruction.

Does this mean that on x86 the ++i of non-native length integers is actually faster for unsigned types because there are less instructions?

Answer 1

There are two parts of this question: Chandler's point about optimizations based on overflow being undefined, and the differences you found in the assembly output.

Chandler's point is that if overflow is undefined behavior, then the compiler can assume that it cannot happen. Consider the following code:

typedef int T;
void CopyInts(int *dest, const int *src) {
    T x = 0;
    for (; src[x]; ++x) {
        dest[x] = src[x];
    }
}

Here, the compiler can safely change the for loop to the following:

    while (*src) {
        *dest++ = *src++;
    }

That's because the compiler does not have to worry about the case that x overflows. If the compiler has to worry about x overflowing, the source and destination pointers suddenly get 16 GB subtracted from them, so the simple transformation above will not work.

At the assembly level, the above is (with GCC 7.3.0 for x86-64, -O2 ):

_Z8CopyIntsPiPKii:
  movl (%rsi), %edx
  testl %edx, %edx
  je .L1
  xorl %eax, %eax
.L3:
  movl %edx, (%rdi,%rax)
  addq $4, %rax
  movl (%rsi,%rax), %edx
  testl %edx, %edx
  jne .L3
.L1:
  rep ret

If we change T to be unsigned int , we get this slower code instead:

_Z8CopyIntsPiPKij:
  movl (%rsi), %eax
  testl %eax, %eax
  je .L1
  xorl %edx, %edx
  xorl %ecx, %ecx
.L3:
  movl %eax, (%rdi,%rcx)
  leal 1(%rdx), %eax
  movq %rax, %rdx
  leaq 0(,%rax,4), %rcx
  movl (%rsi,%rax,4), %eax
  testl %eax, %eax
  jne .L3
.L1:
  rep ret

Here, the compiler is keeping x as a separate variable so that overflow is handled properly.

Instead of relying on signed overflow being undefined for performance, you can use a size type that is the same size as a pointer. This means that such a variable could only overflow at the same time as a pointer, which is also undefined. Hence, at least for x86-64, size_t would also work as T to get the better performance.

Now for the second part of your question: the add instruction. The suffixes on the add instruction are from the so-called "AT&T" style of x86 assembly language. In AT&T assembly language, the parameters are backward from the way Intel writes instructions, and disambiguating instruction sizes is done by adding a suffix to the mnemonic instead of something like dword ptr in the Intel case.

Example:

Intel: add dword ptr [eax], 1

AT&T: addl $1, (%eax)

These are the same instruction, just written differently. The l takes the place of dword ptr .

In the case where the suffix is missing from AT&T instructions, it's because it's not required: the size is implicit from the operands.

add $1, %eax

The l suffix is unnecessary because the instruction is obviously 32-bit, because eax is.

In short, it has nothing to do with overflow. Overflow is always defined at the processor level. On some architectures, such as when using the non- u instructions on MIPS, overflow throws an exception, but it's still defined . C/C++ are the only major languages that make overflow unpredictable behavior.

Answer 2

Either both add and addb/addw/addl/addq take the same number of cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all sizes or the 32-bit unit does the masking internally and therefore has a longer latency.

First of all, it's a 64-bit adder because it supports qword add with the same performance.

In hardware, masking bits doesn't take a whole extra clock cycle; one clock cycle is many gate-delays long. An enable/disable control signal can zero the results from the high half (for 32-bit operand size), or stop carry-propagation at 16 or 8 bits (for smaller operand-sizes that leave the upper bits unmodified instead of zero-extending).

So each execution port with an integer ALU execution unit probably uses the same adder transistors for all operand sizes, using control signals to modify its behaviour. Maybe even using it for XOR as well (by blocking all the carry signals).

I was going to write more about your misunderstanding of the optimization issue, but Myria already covered it.

See also What Every C Programmer Should Know About Undefined Behavior , an LLVM blog post that explains some of the ways UB allows optimization, including specifically promoting a counter to 64-bit or optimizing it away into pointer increments, instead of implementing signed wrap-around like you'd get if signed integer overflow was strictly defined as wrapping. (eg if you compile with gcc -fwrapv , the opposite of -fstrict-overflow )

Your un-optimized compiler output is pointless and doesn't tell us anything. The x86 add instruction implements both unsigned and signed-2's-complement addition, because those are both the same binary operation . The different code-gen at -O0 is merely an artefact of compiler internals, not anything fundamental that would happen in real code (with -O2 or -O3 ).

Performance difference of signed and unsigned integers of non-native length

Question

2 answers

solution1
3 ACCPTED 2018-04-11 19:44:26

solution2
1 2018-04-13 10:22:05

Performance difference of signed and unsigned integers of non-native length

Question

2 answers

solution1 3 ACCPTED 2018-04-11 19:44:26

solution2 1 2018-04-13 10:22:05

solution1
3 ACCPTED 2018-04-11 19:44:26

solution2
1 2018-04-13 10:22:05