简体   繁体   中英

Why doesn't GCC use partial registers?

Disassembling write(1,"hi",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in:

ba03000000     mov edx, 3 ; thanks for the correction jester!
bf01000000     mov edi, 1
31c0           xor eax, eax
e9d8ffffff     jmp loc.imp.write

I'm not into compiler development but since every value moved into these registers are constant and known compile-time, I'm curious why doesn't gcc uses dl , dil , and al instead. Some may argue that this feature won't make any difference in performance but there's a big difference in executable size between mov $1, %rax => b801000000 and mov $1, %al => b001 when we are talking about thousands of register accesses in a program. Not only small size if part of a software's elegance, it does have effect on performance.

Can someone explain why did "GCC decide" that it doesn't matter?

Yes, GCC generally avoids writing to partial registers, unless optimizing for size ( -Os ) instead of purely speed ( -O3 ). Some cases require writing at least the 32-bit register for correctness, so a better example would be something like:

char foo(char *p) { return *p; } char foo(char *p) { return *p; } compiles to movzx eax, byte ptr [rdi]
instead of mov al, [rdi] . https://godbolt.org/z/4ca9cTG9j

But GCC doesn't always avoid partial registers, sometimes even causing partial-register stalls https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533


Writing partial registers entails a performance penalty on many x86 processors because they are renamed into different physical registers from their whole counterpart when written. (For more about register renaming enabling out-of-order execution, see this Q&A ).

But when an instruction reads the whole register, the CPU has to detect the fact that it doesn't have the correct architectural register value available in a single physical register. (This happens in the issue/rename stage, as the CPU prepares to send the uop into the out-of-order scheduler.)

It's called a partial register stall . Agner Fog's microarchitecture manual explains it pretty well:

6.8 Partial register stalls (PPro/PII/PIII and early Pentium-M)

Partial register stall is a problem that occurs when we write to part of a 32-bit register and later read from the whole register or a bigger part of it.
Example:

; Example 6.10a. Partial register stall
mov al, byte ptr [mem8]
mov ebx, eax ; Partial register stall

This gives a delay of 5 - 6 clocks . The reason is that a temporary register has been assigned to AL to make it independent of AH . The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX .

Behaviour in different CPUs :

Partial registers are never renamed. Writing a partial register merges into the full register, making the write depend on the old value of the full register as an input.

Without partial-register renaming, the input dependency for the write is a false dependency if you never read the full register. This limits instruction-level parallelism because reusing an 8 or 16-bit register for something else is not actually independent from the CPU's point of view (16-bit code can access 32-bit registers, so it has to maintain correct values in the upper halves). And also, it makes AL and AH not independent. When Intel designed P6-family (PPro released in 1993), 16-bit code was still common, so partial-register renaming was an important feature to make existing machine code run faster. (In practice, many binaries don't get recompiled for new CPUs.)

That's why compilers mostly avoid writing partial registers. They use movzx / movsx whenever possible to zero- or sign-extend narrow values to a full register to avoid partial-register false dependencies (AMD) or stalls (Intel P6-family). Thus most modern machine code doesn't benefit much from partial-register renaming, which is why recent Intel CPUs are simplifying their partial-register renaming logic.

As @BeeOnRope's answer points out , compilers still read partial registers, because that's not a problem. (Reading AH/BH/CH/DH can add an extra cycle of latency on Haswell/Skylake, though, see the earlier link about partial registers on recent members of Sandybridge-family.)


Also note that write takes arguments that, for an x86-64 typically configured GCC, need whole 32-bit and 64-bit registers so it couldn't simply be assembled into mov dl, 3 . The size is determined by the type of the data, not the value of the data.

Only 32-bit register writes implicitly zero-extend to the full 64-bit; writing 8 and 16-bit partial registers leave the upper bytes unchanged. (This makes it tricky for hardware to handle efficiently, which is why AMD64 didn't follow that pattern .)

Finally, in certain contexts, C has default argument promotions to be aware of, though this is not the case .
Actually, as RossRidge pointed out, the call was probably made without a visible prototype.


Your disassembly is misleading, as @Jester pointed out.
For example mov rdx, 3 is actually mov edx, 3 , although both have the same effect—that is, to put 3 in the whole rdx .
This is true because an immediately value of 3 doesn't require sign-extension and a MOV r32, imm32 implicitly clears the upper 32 bits of the register.

In fact, gcc very often uses partial registers . If you look generated code, you'll find lots of cases where partial registers are used.

The short answer for your particular case , is because gcc always sign or zero-extends arguments to 32-bits when calling a C ABI function .

The de-facto SysV x86 and x86-64 ABI adopted by gcc and clang requires that parameters smaller than 32-bits are zero or sign-extended to 32-bits. Interestingly, they don't need to be extended all the way to 64-bit.

So for a function like the following on a 64-bit platform SysV ABI platform:

void foo(short s) {
 ...
}

... the argument s is passed in rdi and the bits of s will be as follows (but see my caveat below regarding icc ):

  bits 0-31:  SSSSSSSS SSSSSSSS SPPPPPPP PPPPPPPP
  bits 32-63: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  where:
  P: the bottom 15 bits of the value of `s`
  S: the sign bit of `s` (extended into bits 16-31)
  X: arbitrary garbage

The code for foo can depend on the S and P bits, but not on the X bits, which may be anything.

Similarly, for foo_unsigned(unsigned short u) , you'd have 0 in bits 16-31, but it would otherwise be identical.

Note that I said defacto - because it actually isn't really documented what to do for smaller return types, but you can see Peter's answer here for details. I also asked a related question here .

After some further testing, I concluded that icc actually breaks this defacto standard. gcc and clang seem to adhere to it, but gcc only in a conservative way: when calling a function, it does zero/sign-extend arguments to 32-bits, but in its function implementations in doesn't depend on the caller doing it. clang implements functions that depend on the caller extending the parameters to 32-bits. So in fact clang and icc are mutually incompatible even for plain C functions if they have any parameters smaller than int .

All three of the earlier answers are wrong in different ways.

The accepted answer by Margaret Bloom implies that partial register stalls are to blame. Partial register stalls are a real thing, but are unlikely to be relevant to GCC's decision here.

If GCC replaced mov edx,3 by mov dl,3 , then the code would just be wrong, because writes to byte registers (unlike writes to dword registers) don't zero the rest of the register. The parameter in rdx is of type size_t , which is 64 bits, so the callee will read the full register, which will contain garbage in bits 8 to 63. Partial register stalls are purely a performance issue; it doesn't matter how fast the code runs if it's wrong.

That bug could be fixed by inserting xor edx,edx before mov dl,3 . With that fix, there is no partial register stall, because zeroing a full register with xor or sub and then writing to the low byte is special-cased in all CPUs that have the stalling problem. So partial register stalls are still irrelevant with the fix.

The only situation where partial register stalls would become relevant is if GCC happened to know that the register was zero, but it wasn't zeroed by one of the special-cased instructions. For example, if this syscall was preceded by

loop:
  ...
  dec edx
  jnz loop

then GCC could deduce that rdx was zero at the point where it wants to put 3 in it, and mov dl,3 would be correct – but it would be a bad idea in general because it could cause a partial-register stall. (Here, it wouldn't matter because syscalls are so slow anyway, but I don't think GCC has a "slow function that there's no need to speed-optimize calls to" attribute in its internal type system.)


Why doesn't GCC emit xor followed by a byte move, if not because of partial register stalls? I don't know but I can speculate.

It only saves space when initializing r0 through r3 , and even then it only saves one byte. It increases the number of instructions, which has its own costs (the instruction decoders are frequently a bottleneck). It also clobbers the flags unlike the standard mov , which means it isn't a drop-in replacement. GCC would have to track a separate flag-clobbering register initialization sequence, which in most cases (11/15 of possible destination registers) would be unambiguously less efficient.

If you're aggressively optimizing for size, you can do push 3 followed by pop rdx , which saves 2 bytes regardless of the destination register, and doesn't clobber the flags. But it is probably much slower because it writes to memory and has a false read-write dependence on rsp , and the space savings seem unlikely to be worth it. (It also modifies the red zone , so it isn't a drop-in replacement either.)


supercat's answer says

Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.

Modern optimizing compilers actually use 8-bit GPRs quite a lot. (They use 16-bit GPRs relatively rarely, but I think that's because 16-bit quantities are uncommon in modern code.) 8-bit and 16-bit operations are at least as fast as 32-bit and 64-bit operations at most execution stages, and some are faster.

I previously wrote here "As far as I know, 8-bit operations are as fast as, or faster than, 32/64-bit operations on absolutely every 32/64 bit x86/x64 processor ever made." But I was wrong. Quite a few superscalar x86/x64 processors merge 8- and 16-bit destinations into the full register on every write, which means that write-only instructions like mov have a false read dependency when the destination is 8/16 bits which doesn't exist when it's 32/64 bits. False dependency chains can slow execution if you don't clear the register before every move (or during, using something like movzx ). Newer processors have this problem even though the earliest superscalar processors (Pentium Pro/II/III) didn't have it. In spite of that, modern optimizing compilers do use the smaller registers in my experience.


BeeOnRope's answer says

The short answer for your particular case , is because gcc always sign or zero-extends arguments to 32-bits when calling a C ABI function.

But this function has no parameters shorter than 32 bits in the first place. File descriptors are exactly 32 bits long, and size_t is exactly 64 bits long. It doesn't matter that many of those bits are often zero. They aren't variable-length integers that are encoded in 1 byte if they're small. It would only be correct to use mov dl,3 , with the rest of rdx possibly being nonzero, for a parameter if there was no integer promotion requirement in the ABI and the actual parameter type was char or some other 8-bit type.

On something like the original IBM PC, if AH was known to contain 0 and it was necessary to load AX with a value like 0x34, using "MOV AL,34h" would generally take 8 cycles rather than the 12 required for "MOV AX,0034h"--a pretty big speed improvement (either instruction could execute in 2 cycles if pre-fetched, but in practice the 8088 spends most of its time waiting for instructions to be fetched at a cost of four cycles per byte). On the processors used in today's general-purpose computers, however, the time required to fetch code is generally not a significant factor in overall execution speed, and code size is normally not a particular concern.

Further, processor vendors try to maximize the performance of the kinds of code people are likely to run, and 8-bit load instructions aren't likely to be used nearly as often nowadays as 32-bit load instructions. Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM