简体   繁体   English

编程语言编译器是先翻译成汇编还是直接翻译成机器码?

[英]Do programming language compilers first translate to assembly or directly to machine code?

I'm primarily interested in popular and widely used compilers, such as gcc.我主要对流行和广泛使用的编译器感兴趣,例如 gcc。 But if things are done differently with different compilers, I'd like to know that, too.但是,如果使用不同的编译器做不同的事情,我也想知道。

Taking gcc as an example, does it compile a short program written in C directly to machine code, or does it first translate it to human-readable assembly, and only then uses an (in-built?) assembler to translate the assembly program into binary, machine code -- a series of instructions to the CPU?以 gcc 为例,它是将一个用 C 编写的短程序直接编译成机器码,还是先将它翻译成人类可读的汇编,然后才使用(内置?)汇编器将汇编程序翻译成二进制、机器码——对 CPU 的一系列指令?

Is using assembly code to create a binary executable a significantly expensive operation?使用汇编代码创建二进制可执行文件是一项非常昂贵的操作吗? Or is it a relatively simple and quick thing to do?或者这是一件相对简单快捷的事情?

(Let's assume we're dealing with only the x86 family of processors, and all programs are written for Linux.) (假设我们只处理 x86 系列处理器,并且所有程序都是为 Linux 编写的。)

gcc actually produces assembler and assembles it using the as assembler. gcc 实际上生成汇编程序并使用as汇编程序进行汇编。 Not all compilers do this - the MS compilers produce object code directly, though you can make them generate assembler output.并非所有编译器都这样做 - MS 编译器直接生成目标代码,但您可以让它们生成汇编器输出。 Translating assembler to object code is a pretty simple process, at least compared with compilation.将汇编程序翻译成目标代码是一个非常简单的过程,至少与编译相比是这样。

Some compilers produce other high-level language code as their output - for example, cfront , the first C++ compiler produced C as its output which was then compiled by a C compiler.一些编译器生成其他高级语言代码作为其输出 - 例如, cfront ,第一个 C++ 编译器生成 C 作为其输出,然后由 C 编译器编译。

Note that neither direct compilation or assembly actually produce an executable.请注意,直接编译或汇编实际上都不会产生可执行文件。 That is done by the linker , which takes the various object code files produced by compilation/assembly, resolves all the names they contain and produces the final executable binary.这是由链接器完成的,它接受编译/汇编生成的各种目标代码文件,解析它们包含的所有名称并生成最终的可执行二进制文件。

Almost all compilers, including gcc, produce assembly code because it's easier---both to produce and to debug the compiler.几乎所有的编译器,包括 gcc,都会生成汇编代码,因为它更容易——生成和调试编译器。 The major exceptions are usually just-in-time compilers or interactive compilers, whose authors don't want the performance overhead or the hassle of forking a whole process to run the assembler.主要的例外通常是即时编译器或交互式编译器,它们的作者不希望性能开销或分叉整个过程来运行汇编程序的麻烦。 Some interesting examples include一些有趣的例子包括

  • Standard ML of New Jersey , which runs interactively and compiles every expression on the fly. New Jersey 的标准机器学习,它以交互方式运行并即时编译每个表达式。

  • The tinycc compiler , which is designed to be fast enough to compile, load, and run a C script in well under 100 milliseconds, and therefore doesn't want the overhead of calling the assembler and linker. tinycc 编译器的设计速度足以在 100 毫秒内编译、加载和运行 C 脚本,因此不需要调用汇编器和链接器的开销。

What these cases have in common is a desire for "instantaneous" response.这些案例的共同点是对“即时”响应的渴望。 Assemblers and linkers are plenty fast, but not quite good enough for interactive response.汇编器和链接器速度非常快,但对于交互式响应还不够好。 Yet.然而。

There are also a large family of languages, such as Smalltalk, Java, and Lua , which compile to bytecode, not assembly code, but whose implementations may later translate that bytecode directly to machine code without benefit of an assembler.还有一大类语言,例如 Smalltalk、Java 和Lua ,它们编译为字节码,而不是汇编代码,但它们的实现可以在稍后将字节码直接转换为机器代码,而无需借助汇编程序。

(Footnote: in the early 1990s, Mary Fernandez and I wrote the New Jersey Machine Code Toolkit , for which the code is online, which generates C libraries that compiler writers can use to bypass the standard assembler and linker. Mary used it to roughly double the speed of her optimizing linker when generating a.out . If you don't write to disk, speedups are even greater...) (脚注:在 1990 年代初期,我和 Mary Fernandez 编写了New Jersey Machine Code Toolkit ,其代码在线,它生成编译器编写者可以使用的 C 库来绕过标准汇编器和链接器。Mary 用它大致翻了一番生成a.out时优化链接器的速度。如果您不写入磁盘,则加速会更大......)

According to chapter 2 of Introduction to Reverse Engineering Software (by Mike Perry and Nasko Oskov), both gcc and cl.exe (the back end compiler for MSVC++) have the -S switch you can use to output the assembly that each compiler produces.根据逆向工程软件简介的第 2 章(由 Mike Perry 和 Nasko Oskov 编写),gcc 和 cl.exe(MSVC++ 的后端编译器)都有-S开关,您可以使用它来输出每个编译器生成的程序集。

You can also run gcc in verbose mode ( gcc -v ) to get a list of commands that it executes to see what it's doing behind the scenes.您还可以在详细模式 ( gcc -v ) 下运行 gcc 以获取它执行的命令列表,以查看它在后台执行的操作。

Compilers, in general, parse the source code into an Abstract Syntax Tree (an AST), then into some intermediate language.编译器通常将源代码解析为抽象语法树(AST),然后解析为某种中间语言。 Only then, usually after some optimizations, they emit the target language.只有这样,通常经过一些优化,它们才会发出目标语言。

About gcc, it can compile to a wide variety of targets.关于 gcc,它可以编译为多种目标。 I don't know if for x86 it compiles to assembly first, but I did give you some insight onto compilers - and you asked for that too.我不知道 x86 是否先编译为汇编,但我确实让您对编译器有一些了解 - 而您也要求这样做。

GCC compiles to assembler. GCC 编译为汇编程序。 Some other compilers don't.其他一些编译器没有。 For example, LLVM-GCC compiles to LLVM-assembly or LLVM-bytecode, which is then compiled to machine code.例如,LLVM-GCC 编译为 LLVM-assembly 或 LLVM-bytecode,然后再编译为机器码。 Almost all compilers have some sort of internal representation, LLVM-GCC use LLVM, and, IIRC, GCC uses something called GIMPLE.几乎所有的编译器都有某种内部表示,LLVM-GCC 使用 LLVM,而 IIRC,GCC 使用称为 GIMPLE 的东西。

None of the answers clarifies the fact that an ASSEMBLER is the first layer of abstraction between BINARY CODE and MACHINE DEPENDENT SYMBOLIC CODE.没有一个答案能澄清汇编器是二进制代码和机器相关符号代码之间的第一层抽象这一事实。 A compiler is the second layer of abstraction between MACHINE DEPENDENT SYMBOLIC CODE and MACHINE INDEPENDENT SYMBOLIC CODE.编译器是 MACHINE DEPENDENT SYMBOLIC CODE 和 MACHINE INDEPENDENT SYMBOLIC CODE 之间的第二层抽象。

If a compiler directly converts code to binary code, by definition, it will be called assembler and not a compiler.如果编译器直接将代码转换为二进制代码,根据定义,它将被称为汇编器而不是编译器。

It is more appropriate to say that a compiler uses INTERMEDIATE CODE which may or may not be assembly language eg Java uses byte code as intermediate code and byte code is assembler for java virtual machine (JVM).更恰当地说,编译器使用可能是也可能不是汇编语言的中间代码,例如 Java 使用字节码作为中间代码,而字节码是 Java 虚拟机 (JVM) 的汇编程序。

EDIT: You may wonder why an assembler always produces machine dependent code and why a compiler is capable of producing machine independent code.编辑:您可能想知道为什么汇编程序总是生成机器相关代码以及为什么编译器能够生成机器独立代码。 The answer is very simple.答案很简单。 An assembler is direct mapping of machine code and therefore assembly language it produces is always machine dependent.汇编程序是机器代码的直接映射,因此它生成的汇编语言始终依赖于机器。 On the contrary, we can write more than one versions of a compiler for different machines.相反,我们可以为不同的机器编写多个版本的编译器。 So to run our code independently of machine, we must compile same code but on the compiler version written for that machine.因此,要独立于机器运行我们的代码,我们必须编译相同的代码,但要使用为该机器编写的编译器版本。

There are many phases of compilation.编译有很多阶段。 In abstract, there is the front end that reads the source code, breaks it up into tokens and finally into a parse tree.抽象地说,前端读取源代码,将其分解为标记,最后分解为解析树。

The back end is responsible for first generating a sequential code like three address code eg:后端负责首先生成一个顺序代码,如三个地址代码,例如:

code:代码:

x = y + z + w

into:进入:

reg1 = y + z
x = reg1 + w

Then optimizing it, translating it into assembly and finally into machine language.然后对其进行优化,将其翻译成汇编语言,最后翻译成机器语言。 All steps are layered carefully so that when needed, one of them can be replaced所有步骤都仔细分层,以便在需要时可以更换其中一个

您可能有兴趣收听此播客: GCC 的内部结构

In most multi-pass compilers assembly language is generated during the code generation steps.在大多数多通道编译器中,汇编语言是在代码生成步骤中生成的。 This allows you to write the lexer, syntax and semantic phases once and then generate executable code using a single assembler back end.这允许您一次编写词法分析器、语法和语义阶段,然后使用单个汇编器后端生成可执行代码。 this is used a lot in cross compilers such a C compilers that generates for a range of different cpu's.这在交叉编译器中经常使用,例如为一系列不同的 CPU 生成的 C 编译器。

Just about every compiler has some form of this wheter its an implicit or explicity step.几乎每个编译器都有某种形式的这种形式,无论是隐式还是显式步骤。

虽然所有编译器都没有将源代码转换为中级代码,但在几个编译器中存在将源代码转换为机器级代码的桥梁

Some of the above answers confused me because in some answers GCC(GNU Compiler Collection) is mentioned as a single tool but it's a suite of tools like GNU Assembler(also known as GAS), linker, compiler and debugger which are used together to produce an executable.上面的一些答案让我感到困惑,因为在一些答案中 GCC(GNU 编译器集合)被称为一个单一的工具,但它是一套工具,如 GNU 汇编器(也称为 GAS)、链接器、编译器和调试器,它们一起用于生成一个可执行文件。 And yes, GCC doesn't directly converts the C source file to machine code.是的,GCC 不会直接将 C 源文件转换为机器代码。

It does that in 4 steps:它分4步完成:

  1. Pre-processing - Removing of comments and expanding macros(of C).etc预处理 - 删除注释和扩展宏(C)等
  2. Compilation - Source to Assembly(done by compiler)编译 - 汇编源代码(由编译器完成)
  3. Assembling - Assembly to Machine Code(done by Assembler)组装 - 组装到机器代码(由组装人员完成)
  4. Linking - By default linking standard functions dynamically to shared libraries(done by linker)链接 - 默认情况下将标准函数动态链接到共享库(由链接器完成)

A listing file is a compiler-generated text file that contains the assembly language code produced by the compiler.Most compilers support the generation of listing files during the compilation process.列表文件是编译器生成的文本文件,其中包含编译器生成的汇编语言代码。大多数编译器支持在编译过程中生成列表文件。 For some compilers, such as GCC, this is a standard part of the compilation process because the compiler doesn't directly generate an object file, but instead generates an assembly language file which is then processed by an assembler.对于某些编译器,例如 GCC,这是编译过程的标准部分,因为编译器不直接生成目标文件,而是生成一个汇编语言文件,然后由汇编程序处理。 In such compilers, requesting a listing file simply means that the compiler must not delete it after the assembler is done with it.在这样的编译器中,请求一个列表文件只是意味着编译器在汇编程序完成后不得将其删除。 In other compilers (such as the Microsoft or Intel compilers), a listing file is an optional feature that must be enabled through the command line.在其他编译器(例如 Microsoft 或 Intel 编译器)中,列表文件是必须通过命令行启用的可选功能。

Visual C++ 有一个输出汇编代码的开关,所以我认为它在输出机器代码之前生成汇编代码。

Java compilers compile to java byte code (binary format) and then run this using a virtual machine (jvm). Java 编译器编译为 Java 字节码(二进制格式),然后使用虚拟机 (jvm) 运行它。

Whilst this may seem slow it - it can be faster because the JVM can take advantage of later CPU instructions and new optimizations.虽然这可能看起来很慢 - 它可以更快,因为 JVM 可以利用后来的 CPU 指令和新的优化。 A C++ compiler won't do this - you have to target the instruction set at compile time. C++ 编译器不会这样做 - 您必须在编译时以指令集为目标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM