Linux x86 CPU Instruction Layout Confusion

Question

In x86, I understand multi-byte objects are stored in memory little endian style.

Now generally speaking, when it comes to CPU instructions, the OPCODE determines the purpose of the instruction and data/memory addresses may follow the opcode in it's encoded format. My understanding is the Opcode portion of the instruction should be the most significant byte and thus appear at the highest address of any given instruction encoding representation.

Can someone explain the memory layout on this x86 linux gdb example? I would imagine that the opcode 0xb8 would appear at a higher address due to it being the most significant byte.

(gdb) disassemble _start

Dump of assembler code for function _start:
0x08048080 <+0>:    mov    eax,0x11223344

(gdb) x/1xb _start+0
0x8048080 <_start>:     0xb8
(gdb) x/1xb _start+1
0x8048081 <_start+1>:   0x44
(gdb) x/1xb _start+2
0x8048082 <_start+2>:   0x33
(gdb) x/1xb _start+3
0x8048083 <_start+3>:   0x22
(gdb) x/1xb _start+4
0x8048084 <_start+4>:   0x11

It appears the instruction mov eax, 0x11223344 is encoding as 0x11 0x22 0x33 0x44 0xb8.

Questions.

1.) How does the CPU know how many bytes the instruction will take up if the first byte it see's is not an opcode?

2.) I'm wondering if perhaps x86 cpu instructions do not even have endian-ness and are considering some type of string? (probably way off here)

Answer 1

x86 is a variable length instruction set, you start with a single byte which has no endianness, it is wherever it is.

Then depending on the opcode there may be more bytes and those might for example be a 32 bit immediate, and (if that group of bytes is an immediate or address of some sort) THOSE bytes will be little endian. Say you have the five bytes ABCDE (no endianess, think array or string). The A byte is the opcode, the B byte would then be the lower 8 bits of the immediate and the E the upper 8 bits of the immediate.

Opcode is a hard to use term, in these older 8/16 bit CISC processors like x86 the entire byte was an opcode that you basically looked up in a table to see what it meant (and inside the processor they did use a table to figure out how to execute it). When you look at MIPS or ARM or other (certainly RISC) instruction sets like those, only a portion of the 32 bits are the "opcode" and in neither of those cases is it the same set of bits from one instruction to another, you have to look at various places in the instruction (yes there is overlap to make the decoding sane), MIPS is a lot more consistent you have one blob in one place you look at but one of those patterns requires you to look at another blob of bits to fully decode. ARM you start at a common bit and as you work your way across you are further decoding the instruction, then you may have to grab some random looking spots to fully decode. The rest of the bits are operands, what register to use or immediate or whatever the kind of thing that in a CISC you needed a look up table for (are implied by the opcode but not defined by bits in the opcode).

1) the next byte after the prior instruction will be interpreted as an opcode even if not intended to be one (if execution continues to that byte and doesnt branch). I dont remember my x86 table off hand to know if there are any undefined instructions or not, if undefined then it will hit a handler, otherwise it will decode what it finds as machine code and if it is not properly formed instructions will likely crash, sometimes you get lucky and it just messes something up and keeps going, or even more lucky and you cant tell that it almost crashed.

2) you are right for these 8/16 bit CISC or similar instruction sets they are treated more like strings that you parse through linearly.

Linux x86 CPU Instruction Layout Confusion

Question

1 answers

solution1
5 2016-07-20 01:04:26

Linux x86 CPU Instruction Layout Confusion

Question

1 answers

solution1 5 2016-07-20 01:04:26

solution1
5 2016-07-20 01:04:26