简体   繁体   中英

PE file opcodes

I'm just in the process of writing a PE file parser and I've reached the point where I'd like to parse and interpret the actual code within PE files, which I'm assuming are stored as x86 opcodes.

As an example, each of the exports within a DLL point to RVAs (Relative Virtual Offsets) of where the function will be stored within memory, and I've written a function to convert these RVAs to physical file offsets.

The question is, are these really opcodes, or are they something else?

Does it depend on the compiler/linker as to how the functions are stored within the file, or are they one or two byte X86 opcodes.

As an example, the Windows 7 DLL 'BWContextHandler.dll' contains four functions that are loaded into memory, making them available within the system. The first exported function is 'DllCanUnloadNow', and it is located at offset 0x245D within the file. The first four bytes of this data are: 0xA1 0x5C 0xF1 0xF2

So are these one or two byte opcodes, or are they something else entirely?

If anyone can provide any information on how to examine these, it would be appreciated.

Thanks!

After a bit of further reading, and running the file through the demo version of IDA, I think I'm correct in saying that the first byte 0xA1, is a one byte opcode, meaning mov eax. I got that from here: http://ref.x86asm.net/geek32.html#xA1 and I'm assuming it is correct for the time being.

However, I'm a bit confused as to how the bytes following comprise the rest of the instruction. From the x86 assembler that I know, a move instruction requires two parameters, the destination and the source, so the instruction is to move (something) into the eax register, and I'm assuming that the something comes in the following bytes. However I don't know how to read that information yet :)

x86 encoding is complex multi-byte encoding and you can't simply find a single line in instruction table to decode it as it was in RISC (MIPS/SPARC/DLX). There can be even 16-byte encodings of one instruction: 1-3 byte opcode + several prefixes (including multibyte VEX ) + several fields to encode immediate or memory address, offset, scaling (imm, ModR/M and SIB; moffs). And there are sometimes tens opcodes for single mnemonic. And more, for several cases there are two encoding possible of the same asm line ("inc eax" = 0x40 and = 0xff 0xc0).

one byte opcode, meaning mov eax. I got that from here: http://ref.x86asm.net/geek32.html#xA1 and I'm assuming it is correct for the time being.

Let's take a view on the table:

po ; flds ; mnemonic ; op1 ; op2 ; grp1 ; grp2 ; Description

A1 ; W ; MOV ; eAX ; Ov ; gen ; datamov ; Move ;

(HINT: don't use geek32 table, switch to http://ref.x86asm.net/coder32.html#xA1 - is has less fields with more decoding, eg "A1 MOV eAX moffs16/32 Move")

There are columns op1 and op2, http://ref.x86asm.net/#column_op that are for operands. First one for A1 opcode is always eAX , and second (op2) is Ov. According to table http://ref.x86asm.net/#Instruction-Operand-Codes :

O / moffs Original The instruction has no ModR/M byte; the offset of the operand is coded as a word, double word or quad word (depending on address size attribute) in the instruction. No base register, index register, or scaling factor can be applied (only MOV (A0, A1, A2, A3)).

So, after A1 opcode the memory offset is encoded. I think, there is 32-bit offset for x86 (32-bit mode).

PS: If your task is parse PE and not invent disassembler, use some x86 disassembling library like libdisasm or libudis86 or anything else.

PPS: For original question:

The question is, are these really opcodes, or are they something else?

Yes, "A1 5C F1 F2 05 B9 5C F1 F2 05 FF 50 0C F7 D8 1B C0 F7 D8 C3 CC CC CC CC CC" is x86 machine code.

Disassembly is difficult, particularly for code generated by the Visual Studio compiler, and particularly for x86 programs. There are several issues:

  1. Instructions are variable length, and can start at any offset. Some architectures require instruction alignment. Not x86. If you start reading at address 0, then you will get different results then if you start reading at offset 1. You have to know what the valid "starting locations" (function entry points) are.

  2. Not all addresses in the text section of an executable are code. Some are data. Visual Studio will place "jump tables" (arrays used to implement switch statements) in the text section under neath the procedure that reads them. Misinterpreting data as code will lead you to produce incorrect dis-assembly.

  3. You can't have perfect dis-assemby that will work with all possible programs. Programs can modify themselves. In those cases you have to run the program to know what it does, and that ends up leading to the "halting problem". The best you can hope for is dis-assembly that works on "most" programs.

The algorithm typically used to try and address these issue is called "recursive descent" dis-assembly. It works similarly to a recursive descent parser, in that it starts with a known "entry point" (either the "main" method of an exe, or all the exports of a dll) and then starts disassembling. Other entry points are discovered during dis-assembly. For example, given a "call" instruction, the target will be assumed to be an entry point. The dis-assembler will iteratively disassemble discovered entry points until no more are found.

That technique, however, has some problems. It won't find code that is only ever executed through indirection. On windows, a good example is handlers for SEH exceptions. The code that dispatches to them is actually inside the operating system, so recursive descent dis-assembly will not find them, and won't disassemble them. However, they can often be detected by augmenting recursive descent with pattern recognition (heuristic matching).

Machine learning can be used to automatically identify patterns, but many dis-assemblers (like IDA pro) use hand written patterns with a good deal of success.

In any case, if you want to disassemble x86 code, you need to read the Intel Manual . There are a lot of scenarios that need to be supported. The same bit patterns in an instruction can be interpreted in various different ways depending on modifiers, prefixes, the implicit state of the processor, etc. That's all covered in the manual. Start by reading through the first few sections of Volume I. That will walk through the basic execution environment. Most of the rest of the stuff you need is in Volume II.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM