简体   繁体   English

词法分析器的字面量提取策略

[英]literals extraction policy for a lexical Analyzer

I have built a lexical analyzer for a C like language which for example given this input produces the following result. 我为类似C的语言构建了词法分析器,例如,给定此输入将产生以下结果。

Input 输入项

int i = 0 ; int j = i + 3;

Output 输出量

int    KEYWORD
i      IDENTIFIER
=      OPERATOR
;      PUNCTUATION
int    KEYWORD
j      IDENTIFIER
=      OPERATOR
i      IDENTIFIER
+      OPERATOR
3      INTEGER_CONSTANT
;      PUNCTUATION

In the above example you may have noticed the given input was syntactically correct, however when I give it something like below it fails. 在上面的示例中,您可能已经注意到给定的输入在语法上是正确的,但是当我给它类似下面的内容时,它会失败。

Input 输入项

int i = "1.2.2222.+\<++++

I have made a class whose sole purpose is to break the above string into small parts (i call them literals , don't know if it is the correct term)that can be matched with regex or validated with DFA. 我制作了一个类,其唯一目的是将上述字符串分解成小部分(我称它们为文字,不知道它是否是正确的术语),可以与regex匹配或通过DFA进行验证。

Problem arises with the ambiguous situations like + where + can either be an addition operator, or a part of an upcoming integer literal or even part of an increment operator. 问题出现在诸如+之类的模棱两可的情况下,其中+可以是加法运算符,也可以是即将到来的整数文字的一部分,甚至可以是增量运算符的一部分。 My teacher requirement is explained in the next paragraph. 我的老师要求在下一段中说明。

if a + is preceded by a + it should be processed as an increment operator. 如果在+之前加上+,则应将其作为增量运算符处理。 In simple words the program must try to look for every possibility and choose the best. 简而言之,程序必须尝试寻找所有可能性并选择最佳方案。 That means if the program has some valid input then some invalid input the again some valid input it should not stop at that invalid input instead keep finding the correct literals. 这意味着,如果程序有一些有效输入,然后有一些无效输入,再有一些有效输入,则它不应在该无效输入处停止,而是继续寻找正确的文字。 For me though I am against it. 对我来说,尽管我反对。 My argument is if a program string becomes invalid at a certain index it should stop processing because we are not writing an error checking system after all. 我的观点是,如果程序字符串在某个索引处无效,则应停止处理,因为我们毕竟不会编写错误检查系统。

I have tried to code all possibilities using a complex (for me) nested if else structure and gotten partial success. 我试图使用嵌套的if(其他)结构(对我来说)复杂的方式来编码所有可能性,并获得了部分成功。 Can nay of you suggest me a simpler and elegant solution. 您能否建议我一个更简单,更优雅的解决方案。 I have also thought of structuring this problem into a state machine but I am not too sure because I have never implemented a state machine before other than the a DFA that can just tell yes or no for pattern matching. 我也曾考虑过将这个问题构造成状态机,但我不太确定,因为除了DFA之前,我从未实现过状态机,而DFA只能判断是或否进行模式匹配。

As you can see it is a homework question but I am not asking for just code. 如您所见,这是一个家庭作业问题,但我不仅要求提供代码。

The usual approach to lexical analysis is to use the "maximal munch" algorithm: the input stream is divided into tokens by repeatedly taking the longest prefix which could be a single token. 词法分析的常用方法是使用“最大嚼数”算法:通过重复采用最长的前缀(可以是单个令牌)将输入流划分为令牌。 See this answer for one algorithm. 有关一种算法,请参见此答案

It is occasionally necessary to make exceptions to this rule (in c++, for example, <:: is normally lexed into < , :: ) but on the whole, the maximal munch rule is easy to implement and, more importantly, to read. 有时需要作出的例外(在C ++中,例如, <::一般lexed成<:: ),但整体而言,最大适合规则是容易实现,更重要的是,阅读。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM