简体   繁体   English

Rust的词汇语法是常规的,无上下文的还是上下文敏感的?

[英]Is Rust's lexical grammar regular, context-free or context-sensitive?

The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. 大多数编程语言的词汇语法都是非常富有表现力的,以便快速掌握它。 I'm not sure what category Rust's lexical grammar belongs to. 我不确定Rust的词汇语法属于什么类别。 Most of it seems regular, probably with the exception of raw string literals : 大多数似乎是常规的,可能除了原始字符串文字

let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);

Which prints: 哪个印刷品:

Hi lovely "\" and "#", welcome to Rust

As we can add arbitrarily many # , it seems like it can't be regular, right? 因为我们可以随意添加许多# ,看起来它不能正常,对吧? But is the grammar at least context-free? 但语法是否至少没有上下文? Or is there something non-context free about Rust's lexical grammar? 或者是否有关于Rust的词汇语法的非上下文自由的东西?


Related : Is Rust's syntactical grammar context-free or context-sensitive? 相关Rust的语法语法是无上下文还是上下文敏感?

The raw string literal syntax is not context-free. 原始字符串文字语法不是无上下文的。

If you think of it as a string surrounded by r# k "…"# k (using the superscript k as a count operator), then you might expect it to be context-free: 如果你认为它是由r# k "…"# k包围的字符串(使用上标k作为计数运算符),那么你可能期望它是无上下文的:

raw_string_literal
   : 'r' delimited_quoted_string
delimited_quoted_string
   : quoted_string
   | '#' delimited_quoted_string '#'

But that is not actually the correct syntax, because the quoted_string is not allowed to contain "# k although it can contain "# j for any j<k 但是,这并不是真正正确的语法,因为quoted_string不允许含有"# k ,虽然它可以包含"# j任何j<k

Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k -repetition in a single production, and stack automata can only handle two. 排除终止序列而不排除任何其他类似的不同长度的序列不能用无上下文语法来完成,因为它涉及在单个生产中使用k重复三次 (或更多次),而栈自动机只能处理两个。 (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.) (语法不是没有上下文的证据非常复杂,所以我不打算在这里尝试缺少MathJax。我能想出的最好的证据是使用Ogden的引理和不常见的引用(但非常有用)在有限状态传感器的应用下关闭无上下文语法的属性。)

C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. C ++原始字符串文字也是上下文敏感的[或者如果分隔符长度不受限制,请参见注释1],并且所有对空格敏感的语言(如Python和Haskell)都是上下文敏感的。 None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. 这些词法分析任务都不是特别复杂,因此上下文敏感性不是一个大问题,尽管大多数标准扫描仪生成器不能提供尽可能多的协助。 But there it is. 但它确实如此。

Rust's lexical grammar offers a couple of other complications for a scanner generator. Rust的词法语法为扫描仪生成器提供了许多其他复杂功能。 One issue is the double meaning of ' , which is used both to create character literals and to mark lifetime variables and loop labels. 一个问题是'的双重含义,它既用于创建字符文字,也用于标记生命周期变量和循环标签。 Apparently it is possible to determine which of these applies by considering the previously recognized token. 显然,可以通过考虑先前识别的令牌来确定其中的哪一个适用。 That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; 这可以通过词法扫描器解决,该词法扫描器能够从单个模式生成两个连续的令牌,或者可以使用无扫描器解析器来完成; the latter solution would be context-free but not regular. 后一种解决方案是无背景的,但不是常规的。 (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.) (C ++使用'作为数字文字的一部分不会导致同样的问题; C ++令牌可以用正则表达式识别,因为'不能用作数字文字的第一个字符。)

Another slightly context-dependent lexical issue is that the range operator, .. , takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3 , rather than two floating point numbers 2. .3 , which is how it would be analysed in most languages which use the maximal munch rule. 另一个略微依赖于上下文的词汇问题是,范围运算符..优先于浮点值,因此2..3必须在词法上分析为三个标记: 2 .. 3 ,而不是两个浮点数2。 .3 ,这是在大多数使用最大蒙克规则的语言中进行分析的方法。 Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. 同样,这可能会或可能不会被视为偏离正则表达式标记化,因为它取决于尾随上下文。 But since the lookahead is at most one character, it could certainly be implemented with a DFA. 但由于前瞻最多只有一个字符,因此可以使用DFA实现。

Postscript 后记

On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". 经过反思,我不确定询问“词汇语法”是否有意义。 Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. 或者,至少,它是模糊的:“词汇语法”可能指的是所有语言“令牌”的组合语法,或者它可能指的是将句子分成标记的行为。 The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. 后者实际上是一个传感器,而不是一个解析器,并提出了这个语言是否可以用有限状态传感器进行标记的问题。 (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.) (答案是,不是,因为FSA,甚至是PDA都无法识别原始字符串。)

Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. 识别单个令牌并对输入流进行标记不一定是等效的。 It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. 可以想象一种语言,其中各个令牌都被正则表达式识别,但输入流不能用有限状态换能器处理。 That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U . 如果存在两个正则表达式TU ,则会发生这种情况,使得一些字符串匹配T是最长的令牌,它是U无限字符串集的严格前缀。 As a simple (and meaningless) example, take a language with tokens: 作为一个简单(无意义)的示例,请使用带有标记的语言:

a
a*b

Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of a s (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the a s and the following b (if present). 这两个令牌的显然是规则的,但是输入流不能以有限状态换能器标记化,因为它必须检查的任何序列a决定是否回退到所述第一前(任何长度)■ a或接受由所述令牌所有的a S和下面b (如果存在)。

Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases. 很少有语言能够显示这种病态(据我所知,Rust不是其中之一),但从技术上来说,它存在于一些关键词是多字短语的语言中。

Notes 笔记

  1. Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. 实际上,从技术意义上讲,C ++原始字符串文字是常规的(因此无上下文),因为它们的分隔符限于从88个字符的字母表中提取的最大长度为16的字符串。 That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter. 这意味着(理论上)可以创建一个由13,082,362,351,752,551,144,309,757,252,761模式组成的正则表达式,每个模式匹配一​​个不同的可能的原始字符串分隔符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM