简体   繁体   English

编译器:词法分析的限制

[英]Compiler: limitation of lexical analysis

In classic Compiler theory, the first 2 phases are Lexical Analysis and Parsing. 在经典的编译器理论中,前两个阶段是词法分析和解析。 They're in a pipeline. 他们在管道中。 Lexical Analysis recognizes tokens as the input of Parsing. 词法分析将标记视为解析的输入。

But I came across some cases which are hard to be correctly recognized in Lexical Analysis. 但我遇到了一些在词法分析中很难被正确识别的案例。 For example, the following code about C++ template: 例如,以下有关C ++模板的代码:

map<int, vector<int>>

the >> would be recognized as bitwise right shift in a "regular" Lexical Analysis, but it's not correct. >>将被识别为“常规”词法分析中的按位右移,但这不正确。 My feeling is it's hard to divide the handling of this kind of grammars into 2 phases, the lexing work has to be done in the parsing phase, because correctly parsing the >> relies on the grammar, not only the simple lexical rule. 我的感觉是很难将这种语法的处理分为两个阶段,lexing工作必须在解析阶段完成,因为正确解析>>依赖于语法,而不仅仅是简单的词法规则。

I'd like to know the theory and practice about this problem. 我想知道关于这个问题的理论和实践。 Also, I'd like to know how does C++ compiler handle this case? 另外,我想知道C ++编译器如何处理这种情况?

The C++ standard requires that an implementation perform lexical analysis to produce a stream of tokens, before the parsing stage. C ++标准要求实现执行词法分析以在解析阶段之前生成令牌流。 According to the lexical analysis rules, two consecutive > characters (not followed by = ) will always be interpreted as one >> token. 根据词法分析规则,两个连续的>字符(后面没有= )将始终被解释为一个>>标记。 The grammar provided with the C++ standard is defined in terms of these tokens. C ++标准提供的语法是根据这些标记定义的。

The requirement that in certain contexts (such as when expecting a > within a template-id) the implementation should interpret >> as two > is not specified within the grammar. 在语法中没有指定在某些上下文中(例如在期望模板ID内的>时)实现应该将>>解释为两个>的要求。 Instead the rule is specified as a special case: 而是将规则指定为特例:

14.2 Names of template specializations [temp.names] ### 14.2模板特化名称[temp.names] ###

After name lookup (3.4) finds that a name is a template-name or that an operator-function-id or a literal-operator-id refers to a set of overloaded functions any member of which is a function template if this is followed by a < , the < is always taken as the delimiter of a template-argument-list and never as the less-than operator. 名称查找(3.4)后发现名称是模板名称或者operator-function-idliteral-operator-id引用一组重载函数,如果后面跟着函数模板,则其中任何成员都是函数模板a <<始终作为模板参数列表的分隔符,永远不作为小于运算符。 When parsing a template-argument-list, the first non-nested > is taken as the ending delimiter rather than a greater-than operator. 解析template-argument-list时,第一个非嵌套>被视为结束分隔符,而不是大于运算符。 Similarly, the first non-nested >> is treated as two consecutive but distinct > tokens, the first of which is taken as the end of the template-argument-list and completes the template-id . 类似地,第一个非嵌套>>被视为两个连续但不同的>标记,第一个被视为template-argument-list的结尾并完成template-id [ Note: The second > token produced by this replacement rule may terminate an enclosing template-id construct or it may be part of a different construct (eg a cast).—end note ] [注意:此替换规则生成的第二个>令牌可以终止封闭的模板ID构造,或者它可以是不同构造的一部分(例如演员).-结束注释]

Note the earlier rule, that in certain contexts < should be interpreted as the < in a template-argument-list . 注意前面的规则,在某些上下文中<应该被解释为<模板参数列表中 This is another example of a construct that requires context in order to disambiguate the parse. 这是需要上下文以消除解析歧义的构造的另一个示例。

The C++ grammar contains many such ambiguities which cannot be resolved during parsing without information about the context. C ++语法包含许多这样的歧义,在解析过程中无法解决这些歧义而没有关于上下文的信息。 The most well known of these is known as the Most Vexing Parse , in which an identifier may be interpreted as a type-name depending on context. 其中最为人所知的是最令人烦恼的解析 ,其中标识符可以根据上下文被解释为类型名称

Keeping track of the aforementioned context in C++ requires an implementation to perform some semantic analysis in parallel with the parsing stage. 在C ++中跟踪上述上下文需要一个实现来与解析阶段并行执行一些语义分析。 This is commonly implemented in the form of semantic actions that are invoked when a particular grammatical construct is recognised in a given context. 这通常以语义动作的形式实现,当在给定的上下文中识别特定的语法构造时,该语义动作被调用。 These semantic actions then build a data structure that represents the context and permits efficient queries. 然后,这些语义操作构建一个表示上下文的数据结构,并允许有效的查询。 This is often referred to as a symbol table , but the structure required for C++ is pretty much the entire AST . 这通常被称为符号表 ,但C ++所需的结构几乎就是整个AST

These kind of context-sensitive semantic actions can also be used to resolve ambiguities. 这些上下文敏感的语义动作也可用于解决歧义。 For example, on recognising an identifier in the context of a namespace-body , a semantic action will check whether the name was previously defined as a template. 例如,在识别名称空间主体的上下文中的标识符时,语义动作将检查该名称是否先前被定义为模板。 The result of this will then be fed back to the parser. 然后将其结果反馈给解析器。 This can be done by marking the identifier token with the result, or replacing it with a special token that will match a different grammar rule. 这可以通过使用结果标记标识符标记,或者将其替换为与不同语法规则匹配的特殊标记来完成。

The same technique can be used to mark a < as the beginning of a template-argument-list , or a > as the end. 可以使用相同的技术将<标记为模板参数列表的开头,或将>标记为结束。 The rule for context-sensitive replacement of >> with two > poses essentially the same problem and can be resolved using the same method. 上下文敏感的替换>> with two >带来了基本相同的问题,可以使用相同的方法解决。

You are right, the theoretical clean distinction between lexer and parser is not always possible. 你是对的,词法分析器和解析器之间的理论上的清晰区分并不总是可行的。 I remember a porject I worked on as a student. 我记得我作为一名学生做过的一个项目。 We were to implement a C compiler, and the grammar we used as a basis would treat typedefined names as types in some cases, as identifiers in others. 我们要实现一个C编译器,我们用作基础的语法在某些情况下将类型定义的名称视为类型,在其他情况下作为标识符。 So the lexer had to switch between these two modes. 因此词法分析者必须在这两种模式之间切换。 The way I implemented this back then was using special empty rules, which reconfigured the lexer depending on context. 我当时实现它的方式是使用特殊的空规则,它根据上下文重新配置词法分析器。 To accomplish this, it was vital to know that the parser would always use exactly one token of look-ahead. 要做到这一点,至关重要的是要知道解析器总是只使用一个前瞻标记。 So any change to lexer behaviour would have to occur at least one lexiacal token before the affected location. 因此,对词法分析器行为的任何更改都必须在受影响的位置之前至少发生一个lexiacal令牌。 In the end, this worked quite well. 最后,这很有效。

In the C++ case of >> you mention, I don't know what compilers actually do. 在C ++的情况下>>你提到的,我不知道究竟编译器一样。 willj quoted how the specification phrases this, but implementations are allowed to do things differently internally, as long as the visible result is the same. willj引用了规范如何对此进行短语,但只要可见结果相同,就允许实现在内部做不同的事情。 So here is how I'd try to tackle this: upon reading a > , the lexer would emit token GREATER , but also switch to a state where each subsequent > without a space in between would be lexed to GREATER_REPEATED . 所以这就是我试图解决这个问题的方法:在读取> ,词法分析器将发出令牌GREATER ,但也会切换到一个状态,其中每个后续的> 没有空格将被限制为GREATER_REPEATED Any other symbol would switch the state back to normal. 任何其他符号都会将状态切换回正常状态。 Instead of state switches, you could also do this by lexing the regular expression >+ , and emitting multiple tokens from this rule. 您也可以通过激活正则表达式>+ ,并从此规则中发出多个令牌,而不是状态切换。 In the parser, you could then use rules like the following: 在解析器中,您可以使用如下规则:

rightAngleBracket: GREATER | GREATER_REPEATED;
rightShift: GREATER GREATER_REPEATED;

With a bit of luck, you could make template argument rules use rightAngleBracket, while expressions would use rightShift. 运气好的话,你可以使模板参数规则使用rightAngleBracket,而表达式则使用rightShift。 Depending on how much look-ahead your parser has, it might be neccessary to introduce additional non-terminals to hold longer sequences of ambiguous content, until you encounter some context which allows you to eventually make the decision between these cases. 根据您的解析器有多少前瞻,可能需要引入额外的非终端来保存更长的模糊内容序列,直到您遇到一些允许您最终在这些情况之间做出决定的上下文。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM