简体   繁体   中英

Mixed destination/source operand order in RISC-V assembly syntax

Most instructions in RISC-V assembler order the destination operand before the source one, eg:

li  t0, 22        # destination, source
li  t1, 1         # destination, source
add t2, t0, t1    # destination, source

But the store instructions have that order reversed:

sb    t0, (sp)    # source, destination
lw    t1, (a0)    # destination, source
vlb.v v4, (a1)    # destination, source
vsb.v v5, (a2)    # source, destination

How come?

What is the motivation for this (arguably) asymmetric assembler syntax design?

I don't see a real inconsistency in RISC-V assembly when it comes to destination and source operands : The destination operand – when it's part of the instruction encoding – always corresponds to the first operand in the assembly language.

If we look at the following instruction examples from four of the six different instruction formats:

  • R-type : add t0, t1, t2
  • I-type : addi t0, t1, 1 1
  • J-type : jal ra, off
  • U-type : lui t0, 0x12345

In the assembly instructions above, the destination operand is the first operand. Clearly, this destination operand correspond to the destination register in the instruction encoding.

Now, let's focus on the store instructions (S-type format). As an example, consider the following store instruction:

sw t0, 8(sp)

I think it is crystal clear that t0 above is a source operand since the store instruction stores its contents in memory.

We can be tempted to think that 8(sp) is a destination operand . However, by closely looking at the S-type instruction format:

S型格式

We can tell that the 8(sp) part in the assembly instruction above isn't really a single operand but actually two, ie, the immediate 8 (ie, imm ) and the source register sp (ie, rs1 ). If the instruction could be expressed instead like (similar to addi 2 ):

sw t0, sp, 8

It would become evident that this instruction takes three operands, not just two.

The register sp is not modified, only read; it can't be, therefore, considered a destination register . It is also a source register , just as t0 is – the register whose contents the store instruction stores in memory. Memory is the destination operand since it is what receives the content of t0 .

The S-type instruction format doesn't encode a destination operand. What the instruction does encode is addressing information on the destination operand . For sw t0, 8(sp) , the destination operand is the word in memory at the location specified by the effective address that the store instruction calculates from sp and 8 . The register sp contains part of that addressing information about that word in memory (ie, the destination operand).

Summary

Assembly instructions in RISC-V that encode a destination operand have this operand as the first one. A store instruction, however, doesn't encode a destination operand. Its destination operand is a location in memory, and the address of this location in memory is computed from the contents of the instruction source operands.


1 We could possibly argue that the jal ra, off instruction above has an additional destination operand, namely pc , because pc is updated in the following way: pcpc + SignExtension( off ) . However, executing any other instruction also results in modifying pc , eg, incrementing pc by four (may be different for branches and jalr ). Anyway, pc is not encoded in any instruction, and it is not directly accessible to the programmer as a register. Therefore, it is not of interest to the discussion. For the same reason, I've also omitted the B-type format from this discussion.

2 Or the just other way around: think as if you could express addi t0, t0, -1 as addi t0, -1(t0) . Would you then say that addi takes two operands (eg, t0 and -1(t0) )?

Assembly language is defined by the assembler, the program. It is up to the author(s) to pick the syntax. An assembler could choose to have the syntax

bob pickle,(jar)

and that would be perfectly valid syntax to store one register into the address defined by another. could probably even use the equivalent of a #define in some assembly language syntaxes.

The why question really means you want to talk to the actual developer who is likely not trolling Stack Overflow, although you might get lucky so this question does not have an actual answer.

To have a chance at success it is in the best interest of the processors developers to create or hire someone to create an assembler initially and later toolchain for their new processor, which would include someone sitting down and examining the machine code and creating a language from that. A chance at success for a third party assembler for a target involves using a syntax for the instructions that resembles those of the original, but why bother making a new one if you are not going to mix it up. The instruction syntax is only a part of the whole language defined by the assembler and you will find wide variations for mips, arm, etc, and will over time for risc-v although the desire to make new tools has gone down dramatically over the last couple of decades.

The only rule a successful assembler has to follow is the rules defined by the logic, the syntax can be whatever they choose for whatever reason they choose. So you have to ask each author/team if you want to know, not sure that even Bugzilla would get you there.

A related why question is since we spent so much of our early life with the destination on the left

y = mx + b

and not

mx + b = y

What sane person would design an assembly language where the instruction part has the destination on the right, even the high level languages don't do that.

A possible answer to your question is that someone way back was lazy and used the same code for load/store, and or cut and pasted it. And the at least RISC folks that followed, followed that convention.

Not just for Intel but for all the major/minor instruction sets you find syntax incompatibilities across tools, x86, arm, mips, msp430, avr, 8051, 6502, z80, etc, and eventually risc-v if not already. The folks that add targets to gnu assembler must take pride in making incompatible assembly languages as they do it so often.

The location within the instruction is generally irrelevant to the assembly language. The authors start off either being in the destination first camp or destination last camp.

add r0,r1,r2  ; r0 = r1 + r2 

add r0,r0,r2  ; r0 + r1 -> r2

and then names of registers is free form and sometimes varies. ax, %ax. r0, $0

A recent (horrible) fad I assume coming from mips and its use in school of v0, a0, t0, etc...and that infecting other unrelated instruction sets. The mangling of different instruction set habits is happening a lot these days.

They choose how to indicate indirection @r1, (r1), [r1]...

How to indicate pre/post increment/modification and so on as they work through the instructions.

Some choose 4(r1) where another would use as [r1,#4]

First assembly languages or heavily used for an individual play a role in how they like to handle others, some folks just have to make their own tool to avoid having to learn another language or deal with what they don't like about another language thus the AT&T thing, possibly the gnu assembler choices. Definitely the way MIPS handled a calling convention and how that notion, feature?, infected other tools and possibly classrooms.

Look at the evolution of x86 assembly languages in particular (the AT&T vs Intel being irrelevant to what I am talking about) over time.

As it should be, you simply learn the language that assembler uses and move on, or you write your own assembler to match the language you prefer, if you publish it and others like it then it can work its way into the norm and you are seeing that happen.

Short answer, because other assembly languages do it. Because you can see a clear connection between risc-v and MIPS in their design, no doubt the authors of the documentation also followed along with a MIPS style that they had been used to leading up to RISC-V. Exceptions to the rule happen, while it would be more of a purist solution to always have the destination left. What is more important is consistency as you pointed out. Don't have one flavor of store one way and another flavor another. Look at MRS/MSR in a typical ARM syntax, destination/source is in the middle, in the same place.

As far as gnu assembler goes, binutils is open source you are perfectly free to switch it around, likewise you are free to create your own assembler with the ordering and syntax as you wish. If you want it to be part of a chain then as with the current toolchains you need to create/change the compiler to match the assembler and linker.

If this is strictly a "why" question, then it is primarily opinion-based and should be closed. The author of the documentation and author of the assembler (backend) were free to choose and this was the choice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM