简体   繁体   English

设计语言词典

[英]Designing a Language Lexer

I'm currently in the process of creating a programming language. 我目前正在创建一种编程语言。 I've laid out my entire design and am in progress of creating the Lexer for it. 我已经布局了整个设计,并且正在为其创建Lexer。 I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists. 过去,我创建了许多词法分析器和词法分析器生成器,但是如果存在的话,从来没有采用“标准”。

Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible? 是否应该创建一种特定的方法来最大程度地利用词法分析器,以便将其与尽可能多的解析器一起使用?

Because the way I design mine, they look like the following: 因为我设计我的方式,所以它们看起来如下所示:

Code: 码:

int main() {
    printf("Hello, World!");
}

Lexer: Lexer:

[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]

Is this the way Lexer's should be made? 这是Lexer的制作方法吗? Also as a side-note, what should my next step be after creating a Lexer? 另外,创建Lexer之后,下一步应该做什么? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch. 我真的不想使用ANTLR或Lex + Yacc或Flex + Bison等东西。我是从头开始的。

If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser. 如果您不想使用解析器生成器[注1],则完全取决于您的词法分析器如何向解析器提供信息。

Even if you do use a parser generator, there are many details which are going to be project-dependent. 即使您确实使用了解析器生成器,也有许多细节将取决于项目。 Sometimes it is convenient for the lexer to call the parser with each token; 有时,词法分析器使用每个标记调用解析器很方便; other times is is easier if the parser calls the lexer; 如果解析器调用词法分析器,则其他时间会更容易; in some cases, you'll want to have a driver which interacts separately with each component. 在某些情况下,您需要一个与每个组件分别交互的驱动程序。 And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well. 显然,令牌的精确数据类型将因项目而异,这也可能影响您的通信方式。

Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue. 就个人而言,我会避免使用全局变量(就像在原始的yacc / lex协议中一样),但这是一个普遍的样式问题。

Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. 大多数词法分析器以流模式工作,而不是对整个输入进行令牌化,然后将令牌向量分配给更高的功率。 Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax. 在一个时间标记化一个令牌具有许多优点,特别是在断词与上下文相关的,并且,让我们面对它,几乎所有的语言在他们的语法某处有一些杂质。 But, again, that's entirely up to you. 但是,这完全取决于您。

Good luck with your project. 祝您项目顺利。


Notes: 笔记:

  1. Do you also forgo the use of compilers and write all your code from scratch in assembler or even binary? 您是否还放弃使用编译器,而是用汇编器甚至二进制文件从头开始编写所有代码?

Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible? 是否应该创建一种特定的方法来最大程度地利用词法分析器,以便将其与尽可能多的解析器一起使用?

In the lexers I've looked at, the canonical API is pretty minimal. 在我看过的词法分析器中,规范的API非常少。 It's basically: 基本上是:

Token readNextToken();

The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. 词法分析器维护对源文本及其内部指针的引用,以指向当前正在查找的位置。 Then, every time you call that, it scans and returns the next token. 然后,每次您调用它时,它都会扫描并返回下一个令牌。

The Token type usually has: 令牌类型通常具有:

  • A "type" enum for which kind of token it is: string, operator, identifier, etc. There are usually special kinds for "EOF", meaning a special terminator token that is produced after the end of the input, and "ERROR" for the rare cases where a syntax error comes from the lexical grammar. 它是哪种类型的令牌的“类型”枚举:字符串,运算符,标识符等。“ EOF”通常具有特殊的种类,即在输入结束后生成的特殊终止符,而“ ERROR”对于少数情况,其中语法错误来自词汇语法。 This is mainly just unterminated string literals or totally unknown characters in the source. 这主要是源中未终止的字符串文字或完全未知的字符。
  • The source text of the token. 令牌的源文本。
  • Sometimes literals are converted to their proper value representation during lexing in which case you'll have that value too. 有时,在词法分析过程中,文字会转换为其正确的值表示形式,在这种情况下,您也将拥有该值。 So a number token would have "123" as text but also have the numeric value 123. Or you can do that during parsing/compilation. 因此,数字令牌将以“ 123”作为文本,但也具有数字 123。或者您可以在解析/编译期间执行此操作。
  • Location within the source file of the token. 令牌源文件中的位置。 This is for error reporting. 这是用于错误报告。 Usually 1-based line and column, but can also just be start and end byte offsets. 通常基于1的行和列,但也可以只是开始和结束字节偏移量。 The latter is a little faster to produce and can be converted to line and column lazily if needed. 后者的生产速度稍快一些,可以根据需要延迟转换为行和列。

Depending on your grammar, you may need to be able to rewind the lexer too. 根据您的语法,您可能还需要能够倒回词法分析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM