Which processor part does a MOV instruction in armv8 use

Question

Suppose I have the following instruction - MOV X5, XZR
What part of the processor hardware would this MOV pseudo instruction use? What I mean is - does the MOV instruction require the use of the ALU or the Memory? It would obviously require accessing the register.

I am curious because I am going through the textbook "Computer Organization and Design" in which the authors discuss 2-issue processors. The requirement for 2 instruction to be in the same packet is that if one instruction is a Memory instruction, then the other must be a ALU/Logic or a branch. The instruction I mentioned above is followed by a Branch instruction, and I am not sure if the 2 instructions can be in the same packet.

If you could share some information about how this pseudo instruction is actually implemented that would be very helpful as well. Thanks for any help.

Answer 1

XZR is an alias for a register that always returns 0 and can't be changed to anything but 0. It's new in AArch64, but other RISCs like MIPS have always had a zero register. (32-bit ARM / Thumb ARMv8 mode is a different architecture that some AArch64 CPUs can also execute.)

Registers don't exist in memory and don't involve memory unless an instruction is moving data from memory to a register or vice versa.

This instruction is basically setting register X5 to zero by copying one register to another.

ARM was part of the whole "RISC" paradigm, with some practical efficiency compromises. AArch64 makes it even more RISCy, removing some ARM things that complicate modern superscalar pipelines, as well as widening registers to 64-bit. Some design principles of that RISC paradigm are:

A large number of registers are provided. AArch64 has 32 integer registers, up from 15 in ARM (not including the program counter). (That was still large compared to x86's 8 back in the day).
There are instructions to load and store data to/from registers (hence why RISC is also called "load-store architecture")
Other instructions such as ADD, SUB, etc. work on registers exclusively - there are limited register-with-memory operations. So things like "Add what's at memory location 1000 to register X" are not used - you have to "Load X2 with what's at memory location 1000" then "X = X + X2". ( add reg, mem or even add mem,reg are classic CISC features that RISCs avoid.)

So given that legacy you'd probably put this instruction in the "ALU" category since it doesn't talk to memory at all, and it only operated on integer registers (not FP/vector). As far as the rest of the pipeline is concerned, it only reads and writes integer register values, not memory and doesn't branch.

But what ALU does on a CPU is: ALU takes inputs, performs an operation, then delivers it to an output. In RISC the input will always be registers.

With MOV, there is no operation, the inputs are simply delivered to the output. It could bypass the ALU, or for simplicity of data paths still go through the ALU with control signals that make it do something like OR with 0 so the value comes out unchanged.

As you can see the real world is not as neat as your textbook. I don't know how the pipeline in any given ARM CPU actually works.

Answer 2

The question really is not about any particular ISA, even though his example is using AArch64 instruction mnemonics, it is about CPU micro-architecture. In particular about a 2-way super-scalar, in-order micro-architecture. The answer is going to be for any particular micro-architecture "it depends" on whether 2 instructions can be scheduled concurrently. So depending on which design you look at, you'll get a different answer. Building a CPU involves many trade-offs to achieve a desired power, performance, and area target, which is why the answers will be different.

Since you are reading "Computer Organization and Design" which is an entry level CPU micro-architecture text-book, lets simplify the micro-architecture to something idealistic instead of concerning yourself with an industry design which at this point will likely only confuse you more. Assume your micro-architecture has 2 identical 3-stage pipes that can handle all operations in a single cycle with no bypass network. Your pipeline now looks like:

| Fetch0 | -> | Decode0 | -> | Execute+Writeback |
| Fetch1 | -> | Decode1 | -> | Execute+Writeback |

In this simplified case, the answer is during decode your two decoders must do register dependency analysis on both instructions. If the mov produces a register the branch consumes, they cannot execute together and you have to delay the branch until the mov executes, otherwise they can flow down the pipeline together.

Of course this decision of what can be paired or not gets more complicated in a real design with asymmetric execution resources, more pipeline stages, multi-cycle instructions, by-pass networks, de-coupled fetch/execute, and speculative execution to name a few micro-architecture tricks of the trade.

If you are interested in finding out whether a commercial design can pair two particular types of instructions together, you can always take a look at a design's software optimization guides if available to understand what resources each instruction uses. For example, here is the Arm Cortex A-55 Optimization Guide .

Which processor part does a MOV instruction in armv8 use

Question

2 answers

solution1
4 ACCPTED 2019-12-04 00:17:49

solution2
3 2019-12-04 15:41:53

Which processor part does a MOV instruction in armv8 use

Question

2 answers

solution1 4 ACCPTED 2019-12-04 00:17:49

solution2 3 2019-12-04 15:41:53

solution1
4 ACCPTED 2019-12-04 00:17:49

solution2
3 2019-12-04 15:41:53