[英]Reverse engineering C-source code from assembly
I would like to know if anyone can help me out with a problem I am having when studying one of the lecture slides from an introductory assembly class that I am taking in school. 我想知道是否有人可以帮助我解决我在学校参加的入门讲习班中的一个讲座幻灯片时遇到的问题。 The problem I am having is not understanding the assembly, it is how exactly the C source code is ordered based on the assembly.
我遇到的问题是不了解程序集,它是如何根据程序集对C源代码进行排序的。 I will post the snippet I am talking about and maybe it will be clearer what I am talking about.
我将发布我正在讨论的片段,也许我会更清楚地谈论我的内容。
C Source given: C来源:
int arith(int x, int y, int z)
{
int t1 = x+y;
int t2 = z+t1;
int t3 = x+4;
int t4 = y * 48;
int t5 = t3 + t4;
int rval = t2 * t5;
return rval;
}
Assembly given: 大会给出:
arith:
pushl %ebp
movl %esp,%ebp
movl 8(%ebp),%eax
movl 12(%ebp),%edx
leal (%edx,%eax),%ecx
leal (%edx,%edx,2),%edx
sall $4,%edx
addl 16(%ebp),%ecx
leal 4(%edx,%eax),%eax
imull %ecx,%eax
movl %ebp,%esp
popl %ebp
ret
I am just confused as to how I am supposed to be able to discern for example that the adding of z + t1
( z + x + y
) is listed on the second line(in the source) when in the assembly it comes after the y * 48
in the assembly code or for example that x + 4
is the 3rd line when in the assembly it is not even in a line by itself, its sort of mixed in with the last leal
statement. 我只是很困惑,我应该能够辨别出例如在第二行(在源代码中)中添加
z + t1
( z + x + y
)时,在汇编之后它出现在汇编代码中的y * 48
或者例如x + 4
是汇编时的第3行,它本身甚至不是一行,它与最后一个leal
语句混合在一起。 It makes sense to me when I have the source but I am supposed to be able to reproduce the source for a test and I do understand that the compiler optimizes things but if anyone has a way of thinking about the reverse engineering that could help me out I would greatly appreciate it if they could walk me through their thought process. 当我有源代码时对我有意义但是我应该能够重现测试的源代码并且我确实理解编译器优化了一些东西但是如果有人有办法考虑可以帮助我的逆向工程如果他们能够引导我完成思考过程,我将不胜感激。
Thanks. 谢谢。
I've broken down the disassembly for you to show how the assembly was produced from the C source. 我已经分解了反汇编,以显示如何从C源生成程序集。
8(%ebp)
= x
, 12(%ebp)
= y
, 16(%ebp)
= z
8(%ebp)
= x
, 12(%ebp)
= y
, 16(%ebp)
= z
arith:
Create the stack frame: 创建堆栈框架:
pushl %ebp
movl %esp,%ebp
x
into eax
, y
into edx
:
x
移动到eax
, y
转换为edx
:
movl 8(%ebp),%eax movl 12(%ebp),%edx
t1 = x + y
.
t1 = x + y
。
leal
(Load effective address) will add edx
and eax
, and t1
will be in ecx
:
leal
(加载有效地址)将添加edx
和eax
, t1
将在ecx
:
leal (%edx,%eax),%ecx
int t4 = y * 48;
in two steps below, multiply by 3, then by 16. t4
will eventually be in edx
:
t4
最终将在edx
:
Multiply edx
by 2, and add edx
to the result, ie. 将
edx
乘以2,并将edx
添加到结果中,即。 edx = edx * 3
: edx = edx * 3
:
leal (%edx,%edx,2),%edx
Shift left 4 bits, ie. 向左移4位,即。 multiply by 16:
乘以16:
sall $4,%edx
int t2 = z+t1;
.
ecx
initially holds t1
, z
is at 16(%ebp)
, at the end of the instruction ecx
will be holding t2
:
ecx
最初持有t1
, z
为16(%ebp)
,在指令结束时ecx
将持有t2
:
addl 16(%ebp),%ecx
int t5 = t3 + t4;
.
t3
was simply x + 4
, and rather than calculating and storing t3
, the expression of t3
is placed inline.
t3
只是x + 4
,而非计算并存储t3
,表达t3
被内嵌放置。
This instruction essential does (x+4) + t4
, which is the same as t3
+ t4
.
(x+4) + t4
,与t3
+ t4
相同。
It adds edx
( t4
) and eax
( x
), and adds 4 as an offset to achieve that result.
edx
( t4
)和eax
( x
),并添加了4作为偏移量来实现该结果。
leal 4(%edx,%eax),%eax
int rval = t2 * t5;
Fairly straight-forward this one; 相当直截了当;
ecx
represents t2
and eax
represents t5
. ecx
代表t2
, eax
代表t5
。 The return value is passed back to the caller through eax
. 返回值通过
eax
传递回调用者。
imull %ecx,%eax
esp
and ebp
:
esp
和ebp
:
movl %ebp,%esp popl %ebp
ret
As others have said, you can't go exactly back to the source from the disassembly. 正如其他人所说,你无法从反汇编中完全回到源头。 It's up to the interpretation of the person reading the assembly to come up with equivalent C code.
这取决于阅读程序集的人的解释,以提出等效的C代码。
Compile with debug information ( -g
), which will embed the source: 编译调试信息(
-g
),它将嵌入源:
gcc -c -g arith.c
If you're on a 64-bit machine, you can tell the compiler to create a 32-bit binary with the -m32
flag (I did so for the example below). 如果您使用的是64位计算机,则可以告诉编译器使用
-m32
标志创建一个32位二进制文件(我在下面的示例中这样做了)。
Use objdump to dump the object file with it's source interleaved: 使用objdump转储目标文件,其源交错:
gcc -c -g arith.c
-d
= disassembly, -S
= display source. -d
=反汇编, -S
=显示源。 You can add -M intel-mnemonic
to use the Intel ASM syntax if you prefer that over the AT&T syntax that your example uses. 您可以添加
-M intel-mnemonic
以使用Intel ASM语法,如果您更喜欢使用您的示例使用的AT&T语法。
Output: 输出:
objdump -d -S arith.o
As you can see, without optimizations the compiler produces a larger binary than the example you have. 如您所见,没有优化,编译器会生成比您拥有的示例更大的二进制文件。 You can play around with that and add a compiler optimization flag when compiling (ie.
-O1
, -O2
, -O3
). 您可以使用它并在编译时添加编译器优化标志(即
-O1
, -O2
, -O3
)。 The higher the optimization level, the more abstract the disassembly's going to seem. 优化级别越高,反汇编看起来就越抽象。
For example, with just level 1 optimization ( gcc -c -g -O1 -m32 arith.c1
), the assembly code produced is a lot shorter: 例如,只有1级优化(
gcc -c -g -O1 -m32 arith.c1
),生成的汇编代码要短得多:
arith.o: file format elf32-i386
Disassembly of section .text:
00000000 <arith>:
int arith(int x, int y, int z)
{
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 20 sub $0x20,%esp
int t1 = x+y;
6: 8b 45 0c mov 0xc(%ebp),%eax
9: 8b 55 08 mov 0x8(%ebp),%edx
c: 01 d0 add %edx,%eax
e: 89 45 fc mov %eax,-0x4(%ebp)
int t2 = z+t1;
11: 8b 45 fc mov -0x4(%ebp),%eax
14: 8b 55 10 mov 0x10(%ebp),%edx
17: 01 d0 add %edx,%eax
19: 89 45 f8 mov %eax,-0x8(%ebp)
int t3 = x+4;
1c: 8b 45 08 mov 0x8(%ebp),%eax
1f: 83 c0 04 add $0x4,%eax
22: 89 45 f4 mov %eax,-0xc(%ebp)
int t4 = y * 48;
25: 8b 55 0c mov 0xc(%ebp),%edx
28: 89 d0 mov %edx,%eax
2a: 01 c0 add %eax,%eax
2c: 01 d0 add %edx,%eax
2e: c1 e0 04 shl $0x4,%eax
31: 89 45 f0 mov %eax,-0x10(%ebp)
int t5 = t3 + t4;
34: 8b 45 f0 mov -0x10(%ebp),%eax
37: 8b 55 f4 mov -0xc(%ebp),%edx
3a: 01 d0 add %edx,%eax
3c: 89 45 ec mov %eax,-0x14(%ebp)
int rval = t2 * t5;
3f: 8b 45 f8 mov -0x8(%ebp),%eax
42: 0f af 45 ec imul -0x14(%ebp),%eax
46: 89 45 e8 mov %eax,-0x18(%ebp)
return rval;
49: 8b 45 e8 mov -0x18(%ebp),%eax
}
4c: c9 leave
4d: c3 ret
You can't reproduce the original source, you can only reproduce an equivalent source. 您无法重现原始来源,您只能重现等效来源。
In your case the calculation for t2
can appear anywhere after t1
and before retval
. 在您的情况下,
t2
的计算可以出现在t1
之后和retval
之前的任何地方。
The source might even have been a single expression: 源可能只是一个表达式:
return (x+y+z) * ((x+4) + (y * 48));
When reverse engineering, you don't care about the original source code line by line, you care about what it does. 逆向工程时,你不关心原始源代码,你关心它的作用。 A side effect is that you see what the code does, not what the programmer intended the code to do.
副作用是你看到代码的作用,而不是程序员想要的代码。
反编译并不是完全可以实现的:当从源代码(其中注释和名称给出了原始程序员的意图的线索)到二进制机器代码(其中指令将由处理器执行)时,存在一些知识损失。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.