[英]Designing a Language Lexer
I'm currently in the process of creating a programming language. 我目前正在创建一种编程语言。 I've laid out my entire design and am in progress of creating the Lexer for it.
我已经布局了整个设计,并且正在为其创建Lexer。 I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists.
过去,我创建了许多词法分析器和词法分析器生成器,但是如果存在的话,从来没有采用“标准”。
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible? 是否应该创建一种特定的方法来最大程度地利用词法分析器,以便将其与尽可能多的解析器一起使用?
Because the way I design mine, they look like the following: 因为我设计我的方式,所以它们看起来如下所示:
Code: 码:
int main() {
printf("Hello, World!");
}
Lexer: Lexer:
[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]
Is this the way Lexer's should be made? 这是Lexer的制作方法吗? Also as a side-note, what should my next step be after creating a Lexer?
另外,创建Lexer之后,下一步应该做什么? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch.
我真的不想使用ANTLR或Lex + Yacc或Flex + Bison等东西。我是从头开始的。
If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser. 如果您不想使用解析器生成器[注1],则完全取决于您的词法分析器如何向解析器提供信息。
Even if you do use a parser generator, there are many details which are going to be project-dependent. 即使您确实使用了解析器生成器,也有许多细节将取决于项目。 Sometimes it is convenient for the lexer to call the parser with each token;
有时,词法分析器使用每个标记调用解析器很方便; other times is is easier if the parser calls the lexer;
如果解析器调用词法分析器,则其他时间会更容易; in some cases, you'll want to have a driver which interacts separately with each component.
在某些情况下,您需要一个与每个组件分别交互的驱动程序。 And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well.
显然,令牌的精确数据类型将因项目而异,这也可能影响您的通信方式。
Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue. 就个人而言,我会避免使用全局变量(就像在原始的yacc / lex协议中一样),但这是一个普遍的样式问题。
Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. 大多数词法分析器以流模式工作,而不是对整个输入进行令牌化,然后将令牌向量分配给更高的功率。 Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax.
在一个时间标记化一个令牌具有许多优点,特别是在断词与上下文相关的,并且,让我们面对它,几乎所有的语言在他们的语法某处有一些杂质。 But, again, that's entirely up to you.
但是,这完全取决于您。
Good luck with your project. 祝您项目顺利。
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
是否应该创建一种特定的方法来最大程度地利用词法分析器,以便将其与尽可能多的解析器一起使用?
In the lexers I've looked at, the canonical API is pretty minimal. 在我看过的词法分析器中,规范的API非常少。 It's basically:
基本上是:
Token readNextToken();
The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. 词法分析器维护对源文本及其内部指针的引用,以指向当前正在查找的位置。 Then, every time you call that, it scans and returns the next token.
然后,每次您调用它时,它都会扫描并返回下一个令牌。
The Token type usually has: 令牌类型通常具有:
Depending on your grammar, you may need to be able to rewind the lexer too. 根据您的语法,您可能还需要能够倒回词法分析器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.