简体繁体 English

语法产生规则与ECMAScript中的解析如何相关

[英]How grammar production rules relate to parsing in ECMAScript

原文 2018-08-15 15:26:36 4 1 javascript/ parsing/ ecmascript-6/ grammar

As noted in the Wikipedia article on Parsing , the process has three stages: 如Wikipedia上有关解析的文章所述，该过程分为三个阶段：

Lexical analysis (tokenization): Convert Unicode code points to tokens 词法分析（令牌化）：将Unicode代码点转换为令牌
Syntactic analysis: Verify that token stream form valid Script / Module, and create Parse Tree 语法分析：验证令牌流形成有效的脚本/模块，并创建分析树
Semantic analysis: Additional verification of tokens (happens after Parse Tree is created?) 语义分析：令牌的其他验证（在创建“分析树”之后发生？

Other than the small confusion in stage (3) above, I wanted to verify that my understanding of the process is correct for ECMAScript . 除了上述阶段（3）中的小混乱之外，我还想验证对过程的理解对ECMAScript是正确的。

Thus, is the below flow correct? 因此，以下流程正确吗？

Lexical Analysis Phase ( ECMAScript Clause 11 ) 词法分析阶段（ ECMAScript条款11 ）

Input: Stream of Unicode code points <-- Terminal symbols in the lexical grammar 输入：Unicode代码点流<-词汇语法中的终端符号
Output: Valid tokens <-- Nonterminal symbols in the lexical grammar 输出：有效标记<-词汇语法中的非终结符
Application of grammar 语法应用
1. Each Unicode code point (character) is analysed, one at the time 分析每个Unicode代码点（字符），一次
2. Longest possible sequence of terminal symbols are replaced with nonterminal symbol, by applying suitable production rule 通过应用适当的生产规则，将最长的终端符号序列替换为非终端符号
3. Then, longest possible sequence of nonterminal symbols are replaced, again by applying production rules 然后，再次通过应用生产规则，替换尽可能长的非终结符序列
4. In the same way, production rules are applied again and again, all way until "goal symbol(s)" are produced 以相同的方式，一次又一次地应用生产规则，直到产生“目标符号”为止
Goal symbols are input elements (aka. tokens), for the syntactic analysis phase (next phase) 目标符号是语法分析阶段（下一个阶段）的输入元素（也称为标记）
Multiple "goal symbols" exist for ECMAScript's lexical grammar ( spec states which to pick ) ECMAScript的词汇语法存在多个“目标符号”（要选择的规范状态）

Syntactic Analysis Phase (ECMAScript Clause 12-15) 句法分析阶段（ECMAScript第12-15条）

Input: Stream of tokens <-- Terminal symbols in syntactic grammar 输入：令牌流<-语法中的终端符号
Output: Parse Tree, with Script|Module as root Parse Node <-- Nonterminal symbol in syntactic grammar 输出：解析树，以脚本|模块作为根解析节点<-语法上的非终结符
Application of grammar 语法应用
1. Start with stream of input elements, aka. 首先从输入元素流开始。 tokens 令牌
2. These tokens are terminal symbols in the syntactic grammar 这些标记是语法中的终端符号
3. Apply production rules by matching maximum stream of symbols with RHS of a suitable production rule, then replacing stream with LHS nonterminal symbol of that rule 通过将最大符号流与合适的生产规则的RHS匹配来应用生产规则，然后用该规则的LHS非终端符号替换流
4. This continues until only "goal symbol" is left 这一直持续到只剩下“目标符号”为止
ECMAScript: Program is valid if we can replace all terminal symbols (tokens), to end with the single "goal symbol" (Script | Module) ECMAScript：如果我们可以替换所有终端符号（标记）并以单个“目标符号”（脚本|模块）结尾，则该程序有效

1 个解决方案

The syntactic parsing does not obey the "maximal munch" rule (select the longest matching prefix). 语法分析不遵循“最大修改”规则（选择最长的匹配前缀）。 In fact, as far as I know ECMA-262 does not specify a parsing algorithm, but does provide an unambiguous context-free grammar which can be parsed, for example with a bottom-up (LR(k)) parser, aside from some issues dealing with automatic semicolon insertion and some restrictions on productions which span a newline (which is not a syntactic token). 实际上，据我所知，ECMA-262没有指定解析算法，但是提供了明确的上下文无关文法，可以使用例如自下而上（LR（k））解析器进行解析，自动分号插入的问题以及跨换行符（不是语法标记）的产品的一些限制。

However, as mentioned in §5.1.4 , the grammar actually recognises a superset of the language; 但是，如§5.1.4所述，语法实际上可以识别该语言的超集。 additional restrictions are provided in the form of supplementary grammars. 其他限制以补充语法的形式提供。

One clarification: The complexities related to having multiple context-dependent lexical goal symbols make it difficult to first divide the input into lexemes and only then combine the lexemes into a parse tree. 一个澄清：与具有多个上下文相关的词汇目标符号相关的复杂性使得很难首先将输入划分为词素，然后仅将词素组合为解析树。 It is impossible to know the correct lexical goal symbol at each point without at least a partial parse, so it is convenient to interleave the syntactic and lexical parses. 没有至少部分解析就不可能在每个点上知道正确的词汇目标符号，因此可以方便地将句法和词汇解析交织在一起。 Practical parsing algorithms operate from left to right, processing lexemes basically in input order, so it is possible to do lexical analysis on demand, only finding a lexeme when the parser needs more input to continue. 实用的解析算法从左到右运行，基本上按输入顺序处理词素，因此可以按需进行词法分析，只有在解析器需要更多输入才能继续时才找到词素。

But aside from that, the overall structure you outline is correct. 但是除此之外，您概述的总体结构是正确的。 In the lexical parse, the longest possible prefix of terminals (characters) are aggregated into a non-terminals to create a lexeme (according to slightly complicated rules about which lexical goal is required); 在词法分析中，将终端（字符）的尽可能长的前缀聚合到一个非词库中以创建词素（根据关于哪个词法目标需要的稍微复杂的规则）； in the syntactic parse, terminals (lexemes) are aggregated into non-terminals to produce a single parse tree corresponding to one of two syntactic goal symbols. 在语法分析中，将终端（词汇）聚合为非终端，以产生与两个语法目标符号之一相对应的单个分析树。

As is often the case with real-world languages, the reality is not quite as clean as that. 就像现实世界中的语言经常发生的那样，现实并非如此干净。 Aside from the need for the parser to indicate which lexical goal is required, there are also the newline rules and automatic semicolon insertion, both of which cross the boundary between lexical and syntactic parsing. 除了需要解析器指示需要哪个词汇目标外，还存在换行规则和自动分号插入，这两者都跨越了词汇和句法分析之间的边界。

Note: 注意：

The use of the words "terminal" and "non-terminal" can be a bit confusing, but I (and the ECMA standard) use them with the standard meaning in a context-free grammar. 单词“终端”和“非终端”的使用可能会有些混乱，但是我（和ECMA标准）在上下文无关的语法中将它们与标准含义一起使用。

A context-free grammar consists of productions, each of which has the form: 上下文无关的语法由产生形式组成，每个产生形式具有：

N ⇒ S …

where N is a non-terminal symbol and S is a possibly-empty sequence of either terminal or non-terminal symbols. 其中N是一个非终止符号，而S是一个终止符号或非终止符号的可能为空的序列。 Terminal symbols are atoms in the representation of the string to be recognized. 终端符号是要识别的字符串表示形式中的原子。

The standard parsing model divides the parse into two levels: lexical and syntactic. 标准的解析模型将解析分为两个层次：词汇层次和句法层次。 The original input is a sequence of characters; 原始输入是一个字符序列； lexical analysis turns this into a sequence of lexemes, which are the input to the syntactic parse. 词法分析将其转换为一系列词素，这些词素是语法分析的输入。

A standard context-free grammar has a single goal symbol, which is one of the non-terminals defined by the grammar. 标准的无上下文语法具有单个目标符号，该目标符号是语法定义的非终结符之一。 The parse succeeds if the entire input can be reduced to this non-terminal. 如果可以将整个输入减少到此非终端，则解析成功。

A lexical scan can be viewed as a context-free grammar with an ordered list of goal symbols. 词法扫描可以看作是无上下文语法，带有目标符号的有序列表。 It tries each goal symbol in turn on successively longer prefixes of the input, and accepts the first goal symbol which matched the longest prefix. 它依次尝试依次输入每个较长的前缀的每个目标符号，并接受与最长前缀匹配的第一个目标符号。 (in practice, this is all done in parallel; I'm talking conceptually here.) When ECMA-262 talks about different lexical goals, it really means different lists of possible goal non-terminals. （实际上，这是并行完成的；我在这里从概念上讲。）当ECMA-262讨论不同的词汇目标时，实际上意味着可能的目标非终结点的不同列表。

It's also useful to augment symbols with semantic attributes; 用语义属性扩展符号也很有用； these attributes do not influence the parse, but they are useful once the parse is done. 这些属性不会影响解析，但是一旦解析完成，它们就很有用。 In particular, the parse tree is built by attaching a tree node as an attribute to each non-terminal created from a production during the parse, so that the final result of the parse is not the non-terminal symbol as such (that's known before the parse starts) but rather the semantic attributes attached to that particular instance of a non-terminal, while the result of the lexical scan at each point is a non-terminal symbol and its associated semantic attributes; 特别是，通过将树节点作为属性附加到在解析过程中从生产创建的每个非终端上，来构建解析树，因此解析的最终结果不是非终端符号本身（以前知道解析开始），而是附加到该非终结符的特定实例的语义属性，而每个点的词法扫描结果是一个非终结符及其关联的语义属性； typical, the semantic attribute will be the associated input sequence, or some function of those characters. 通常，语义属性将是关联的输入序列或这些字符的某些功能。

In any event, the two-level parse involves feeding the output non-terminals of the lexical level as terminals for the syntactic level. 无论如何，两级解析涉及将词法级的输出非终端作为句法级的终端。