简体繁体 English

PE文件操作码

[英]PE file opcodes

原文 2012-12-07 13:22:39 7 2 windows/ parsing/ assembly/ x86/ portable-executable

I'm just in the process of writing a PE file parser and I've reached the point where I'd like to parse and interpret the actual code within PE files, which I'm assuming are stored as x86 opcodes. 我正处于编写PE文件解析器的过程中，我已经达到了解析和解释PE文件中的实际代码的程度，我假设它存储为x86操作码。

As an example, each of the exports within a DLL point to RVAs (Relative Virtual Offsets) of where the function will be stored within memory, and I've written a function to convert these RVAs to physical file offsets. 例如，DLL中的每个导出都指向函数将存储在内存中的RVAs（相对虚拟偏移），并且我编写了一个函数来将这些RVA转换为物理文件偏移。

The question is, are these really opcodes, or are they something else? 问题是，这些是真正的操作码，还是其他的？

Does it depend on the compiler/linker as to how the functions are stored within the file, or are they one or two byte X86 opcodes. 是否依赖于编译器/链接器关于函数如何存储在文件中，或者它们是一个还是两个字节的X86操作码。

As an example, the Windows 7 DLL 'BWContextHandler.dll' contains four functions that are loaded into memory, making them available within the system. 例如，Windows 7 DLL“BWContextHandler.dll”包含四个加载到内存中的函数，使它们在系统中可用。 The first exported function is 'DllCanUnloadNow', and it is located at offset 0x245D within the file. 第一个导出的函数是'DllCanUnloadNow'，它位于文件中的偏移0x245D处。 The first four bytes of this data are: 0xA1 0x5C 0xF1 0xF2 该数据的前四个字节是： 0xA1 0x5C 0xF1 0xF2

So are these one or two byte opcodes, or are they something else entirely? 这些一个或两个字节的操作码是这样的，还是完全不同的？

If anyone can provide any information on how to examine these, it would be appreciated. 如果任何人都可以提供有关如何检查这些信息的任何信息，我们将不胜感激。

Thanks! 谢谢！

After a bit of further reading, and running the file through the demo version of IDA, I think I'm correct in saying that the first byte 0xA1, is a one byte opcode, meaning mov eax. 经过一些进一步的阅读，并通过IDA的演示版本运行文件，我认为我说第一个字节0xA1是一个单字节操作码，这意味着mov eax。 I got that from here: http://ref.x86asm.net/geek32.html#xA1 and I'm assuming it is correct for the time being. 我从这里得到了它： http ： //ref.x86asm.net/geek32.html#xA1 ，我认为它暂时是正确的。

However, I'm a bit confused as to how the bytes following comprise the rest of the instruction. 但是，我对下面的字节如何构成指令的其余部分感到困惑。 From the x86 assembler that I know, a move instruction requires two parameters, the destination and the source, so the instruction is to move (something) into the eax register, and I'm assuming that the something comes in the following bytes. 从我所知道的x86汇编程序来看，移动指令需要两个参数，即目标和源，因此指令是将（某些）移动到eax寄存器中，并且我假设某些内容来自以下字节。 However I don't know how to read that information yet :) 但是我不知道如何阅读这些信息:)

2 个解决方案

x86 encoding is complex multi-byte encoding and you can't simply find a single line in instruction table to decode it as it was in RISC (MIPS/SPARC/DLX). x86编码是复杂的多字节编码，您不能简单地在指令表中找到单行来解码它，就像在RISC（MIPS / SPARC / DLX）中一样。 There can be even 16-byte encodings of one instruction: 1-3 byte opcode + several prefixes (including multibyte VEX ) + several fields to encode immediate or memory address, offset, scaling (imm, ModR/M and SIB; moffs). 一条指令甚至可以有16字节编码：1-3字节操作码+几个前缀（包括多字节VEX ）+几个字段用于编码立即或存储器地址，偏移，缩放（imm，ModR / M和SIB; moff）。 And there are sometimes tens opcodes for single mnemonic. 单个助记符有时会有几十个操作码。 And more, for several cases there are two encoding possible of the same asm line ("inc eax" = 0x40 and = 0xff 0xc0). 而且，对于几种情况，有两种编码可能是相同的asm行（“inc eax”= 0x40和= 0xff 0xc0）。

one byte opcode, meaning mov eax. 一个字节的操作码，意思是mov eax。 I got that from here: http://ref.x86asm.net/geek32.html#xA1 and I'm assuming it is correct for the time being. 我从这里得到了它： http ： //ref.x86asm.net/geek32.html#xA1 ，我认为它暂时是正确的。

Let's take a view on the table: 我们来看看桌子：

po ; po; flds ; flds; mnemonic ; 助记符 op1 ; op1; op2 ; op2; grp1 ; grp1; grp2 ; grp2; Description 描述

A1 ; A1; W ; W; MOV ; MOV; eAX ; eAX; Ov ; Ov; gen ; gen; datamov ; 数据; Move ; 移动;

(HINT: don't use geek32 table, switch to http://ref.x86asm.net/coder32.html#xA1 - is has less fields with more decoding, eg "A1 MOV eAX moffs16/32 Move") （提示：不要使用geek32表，切换到http://ref.x86asm.net/coder32.html#xA1 - 具有更少解码的字段，例如“A1 MOV eAX moffs16 / 32 Move”）

There are columns op1 and op2, http://ref.x86asm.net/#column_op that are for operands. 有op1和op2列， http ：//ref.x86asm.net/#column_op用于操作数。 First one for A1 opcode is always eAX , and second (op2) is Ov. A1操作码的第一个始终是eAX ，第二个（op2）是Ov。 According to table http://ref.x86asm.net/#Instruction-Operand-Codes : 根据表http://ref.x86asm.net/#Instruction-Operand-Codes ：

O / moffs Original The instruction has no ModR/M byte; O / moffs原始指令没有ModR / M字节; the offset of the operand is coded as a word, double word or quad word (depending on address size attribute) in the instruction. 操作数的偏移量在指令中被编码为字，双字或四字（取决于地址大小属性）。 No base register, index register, or scaling factor can be applied (only MOV (A0, A1, A2, A3)). 不能应用基址寄存器，索引寄存器或缩放因子（仅MOV（A0，A1，A2，A3））。

So, after A1 opcode the memory offset is encoded. 因此，在A1操作码之后，存储器偏移被编码。 I think, there is 32-bit offset for x86 (32-bit mode). 我认为，x86（32位模式）有32位偏移量。

PS: If your task is parse PE and not invent disassembler, use some x86 disassembling library like libdisasm or libudis86 or anything else. PS：如果您的任务是解析PE而不是发明反汇编程序，请使用一些x86反汇编库，如libdisasm或libudis86或其他任何东西。

PPS: For original question: PPS：原始问题：

The question is, are these really opcodes, or are they something else? 问题是，这些是真正的操作码，还是其他的？

Yes, "A1 5C F1 F2 05 B9 5C F1 F2 05 FF 50 0C F7 D8 1B C0 F7 D8 C3 CC CC CC CC CC" is x86 machine code. 是，“A1 5C F1 F2 05 B9 5C F1 F2 05 FF 50 0C F7 D8 1B C0 F7 D8 C3 CC CC CC CC CC”是x86机器代码。

Disassembly is difficult, particularly for code generated by the Visual Studio compiler, and particularly for x86 programs. 反汇编很困难，特别是对于Visual Studio编译器生成的代码，特别是对于x86程序。 There are several issues: 有几个问题：

Instructions are variable length, and can start at any offset. 指令是可变长度的，可以从任何偏移量开始。 Some architectures require instruction alignment. 一些架构需要指令对齐。 Not x86. 不是x86。 If you start reading at address 0, then you will get different results then if you start reading at offset 1. You have to know what the valid "starting locations" (function entry points) are. 如果您从地址0开始阅读，那么如果您开始在偏移1处读取，则会得到不同的结果。您必须知道有效的“起始位置”（功能入口点）是什么。
Not all addresses in the text section of an executable are code. 并非可执行文件部分的所有地址都是代码。 Some are data. 有些是数据。 Visual Studio will place "jump tables" (arrays used to implement switch statements) in the text section under neath the procedure that reads them. Visual Studio将在读取它们的过程下的文本部分中放置“跳转表”（用于实现switch语句的数组）。 Misinterpreting data as code will lead you to produce incorrect dis-assembly. 将数据误解为代码会导致产生错误的拆卸。
You can't have perfect dis-assemby that will work with all possible programs. 你不可能拥有适用于所有可能程序的完美dis-assemby。 Programs can modify themselves. 程序可以自行修改。 In those cases you have to run the program to know what it does, and that ends up leading to the "halting problem". 在这些情况下，你必须运行程序才能知道它的作用，最终导致“暂停问题”。 The best you can hope for is dis-assembly that works on "most" programs. 您可以期待的最好的解决方案是“大多数”程序。

The algorithm typically used to try and address these issue is called "recursive descent" dis-assembly. 通常用于尝试解决这些问题的算法称为“递归下降”拆卸。 It works similarly to a recursive descent parser, in that it starts with a known "entry point" (either the "main" method of an exe, or all the exports of a dll) and then starts disassembling. 它类似于递归下降解析器，因为它以已知的“入口点”（exe的“main”方法或dll的所有导出）开始，然后开始反汇编。 Other entry points are discovered during dis-assembly. 在拆卸过程中发现了其他入口点。 For example, given a "call" instruction, the target will be assumed to be an entry point. 例如，给定“调用”指令，目标将被假定为入口点。 The dis-assembler will iteratively disassemble discovered entry points until no more are found. 反汇编程序将迭代地反汇编已发现的入口点，直到找不到更多入口点。

That technique, however, has some problems. 然而，这种技术存在一些问题。 It won't find code that is only ever executed through indirection. 它不会找到仅通过间接执行的代码。 On windows, a good example is handlers for SEH exceptions. 在Windows上，一个很好的例子是SEH异常的处理程序。 The code that dispatches to them is actually inside the operating system, so recursive descent dis-assembly will not find them, and won't disassemble them. 分派给它们的代码实际上是在操作系统内部，因此递归下降分解将无法找到它们，也不会对它们进行反汇编。 However, they can often be detected by augmenting recursive descent with pattern recognition (heuristic matching). 然而，它们通常可以通过增加模式识别（启发式匹配）的递归下降来检测。

Machine learning can be used to automatically identify patterns, but many dis-assemblers (like IDA pro) use hand written patterns with a good deal of success. 机器学习可用于自动识别模式，但许多反汇编程序（如IDA pro）使用手写模式并取得了很大成功。

In any case, if you want to disassemble x86 code, you need to read the Intel Manual . 在任何情况下，如果要反汇编x86代码，则需要阅读“ 英特尔手册” 。 There are a lot of scenarios that need to be supported. 有很多场景需要支持。 The same bit patterns in an instruction can be interpreted in various different ways depending on modifiers, prefixes, the implicit state of the processor, etc. That's all covered in the manual. 根据修饰符，前缀，处理器的隐式状态等，可以以各种不同的方式解释指令中的相同位模式。这些都在手册中有所涉及。 Start by reading through the first few sections of Volume I. That will walk through the basic execution environment. 首先阅读第一卷的前几节。这将介绍基本的执行环境。 Most of the rest of the stuff you need is in Volume II. 你需要的大部分其他东西都在第二卷。