[英]How grammar production rules relate to parsing in ECMAScript
As noted in the Wikipedia article on Parsing , the process has three stages: 如Wikipedia上有关解析的文章所述,该过程分为三个阶段:
Other than the small confusion in stage (3) above, I wanted to verify that my understanding of the process is correct for ECMAScript . 除了上述阶段(3)中的小混乱之外,我还想验证对过程的理解对ECMAScript是正确的。
Thus, is the below flow correct? 因此,以下流程正确吗?
The syntactic parsing does not obey the "maximal munch" rule (select the longest matching prefix). 语法分析不遵循“最大修改”规则(选择最长的匹配前缀)。 In fact, as far as I know ECMA-262 does not specify a parsing algorithm, but does provide an unambiguous context-free grammar which can be parsed, for example with a bottom-up (LR(k)) parser, aside from some issues dealing with automatic semicolon insertion and some restrictions on productions which span a newline (which is not a syntactic token).
实际上,据我所知,ECMA-262没有指定解析算法,但是提供了明确的上下文无关文法,可以使用例如自下而上(LR(k))解析器进行解析,自动分号插入的问题以及跨换行符(不是语法标记)的产品的一些限制。
However, as mentioned in §5.1.4 , the grammar actually recognises a superset of the language; 但是,如§5.1.4所述,语法实际上可以识别该语言的超集。 additional restrictions are provided in the form of supplementary grammars.
其他限制以补充语法的形式提供。
One clarification: The complexities related to having multiple context-dependent lexical goal symbols make it difficult to first divide the input into lexemes and only then combine the lexemes into a parse tree. 一个澄清:与具有多个上下文相关的词汇目标符号相关的复杂性使得很难首先将输入划分为词素,然后仅将词素组合为解析树。 It is impossible to know the correct lexical goal symbol at each point without at least a partial parse, so it is convenient to interleave the syntactic and lexical parses.
没有至少部分解析就不可能在每个点上知道正确的词汇目标符号,因此可以方便地将句法和词汇解析交织在一起。 Practical parsing algorithms operate from left to right, processing lexemes basically in input order, so it is possible to do lexical analysis on demand, only finding a lexeme when the parser needs more input to continue.
实用的解析算法从左到右运行,基本上按输入顺序处理词素,因此可以按需进行词法分析,只有在解析器需要更多输入才能继续时才找到词素。
But aside from that, the overall structure you outline is correct. 但是除此之外,您概述的总体结构是正确的。 In the lexical parse, the longest possible prefix of terminals (characters) are aggregated into a non-terminals to create a lexeme (according to slightly complicated rules about which lexical goal is required);
在词法分析中,将终端(字符)的尽可能长的前缀聚合到一个非词库中以创建词素(根据关于哪个词法目标需要的稍微复杂的规则); in the syntactic parse, terminals (lexemes) are aggregated into non-terminals to produce a single parse tree corresponding to one of two syntactic goal symbols.
在语法分析中,将终端(词汇)聚合为非终端,以产生与两个语法目标符号之一相对应的单个分析树。
As is often the case with real-world languages, the reality is not quite as clean as that. 就像现实世界中的语言经常发生的那样,现实并非如此干净。 Aside from the need for the parser to indicate which lexical goal is required, there are also the newline rules and automatic semicolon insertion, both of which cross the boundary between lexical and syntactic parsing.
除了需要解析器指示需要哪个词汇目标外,还存在换行规则和自动分号插入,这两者都跨越了词汇和句法分析之间的边界。
The use of the words "terminal" and "non-terminal" can be a bit confusing, but I (and the ECMA standard) use them with the standard meaning in a context-free grammar. 单词“终端”和“非终端”的使用可能会有些混乱,但是我(和ECMA标准)在上下文无关的语法中将它们与标准含义一起使用。
A context-free grammar consists of productions, each of which has the form: 上下文无关的语法由产生形式组成,每个产生形式具有:
N ⇒ S …
where N
is a non-terminal symbol and S
is a possibly-empty sequence of either terminal or non-terminal symbols. 其中
N
是一个非终止符号,而S
是一个终止符号或非终止符号的可能为空的序列。 Terminal symbols are atoms in the representation of the string to be recognized. 终端符号是要识别的字符串表示形式中的原子。
The standard parsing model divides the parse into two levels: lexical and syntactic. 标准的解析模型将解析分为两个层次:词汇层次和句法层次。 The original input is a sequence of characters;
原始输入是一个字符序列; lexical analysis turns this into a sequence of lexemes, which are the input to the syntactic parse.
词法分析将其转换为一系列词素,这些词素是语法分析的输入。
A standard context-free grammar has a single goal symbol, which is one of the non-terminals defined by the grammar. 标准的无上下文语法具有单个目标符号,该目标符号是语法定义的非终结符之一。 The parse succeeds if the entire input can be reduced to this non-terminal.
如果可以将整个输入减少到此非终端,则解析成功。
A lexical scan can be viewed as a context-free grammar with an ordered list of goal symbols. 词法扫描可以看作是无上下文语法,带有目标符号的有序列表。 It tries each goal symbol in turn on successively longer prefixes of the input, and accepts the first goal symbol which matched the longest prefix.
它依次尝试依次输入每个较长的前缀的每个目标符号,并接受与最长前缀匹配的第一个目标符号。 (in practice, this is all done in parallel; I'm talking conceptually here.) When ECMA-262 talks about different lexical goals, it really means different lists of possible goal non-terminals.
(实际上,这是并行完成的;我在这里从概念上讲。)当ECMA-262讨论不同的词汇目标时,实际上意味着可能的目标非终结点的不同列表。
It's also useful to augment symbols with semantic attributes; 用语义属性扩展符号也很有用; these attributes do not influence the parse, but they are useful once the parse is done.
这些属性不会影响解析,但是一旦解析完成,它们就很有用。 In particular, the parse tree is built by attaching a tree node as an attribute to each non-terminal created from a production during the parse, so that the final result of the parse is not the non-terminal symbol as such (that's known before the parse starts) but rather the semantic attributes attached to that particular instance of a non-terminal, while the result of the lexical scan at each point is a non-terminal symbol and its associated semantic attributes;
特别是,通过将树节点作为属性附加到在解析过程中从生产创建的每个非终端上,来构建解析树,因此解析的最终结果不是非终端符号本身(以前知道解析开始),而是附加到该非终结符的特定实例的语义属性,而每个点的词法扫描结果是一个非终结符及其关联的语义属性; typical, the semantic attribute will be the associated input sequence, or some function of those characters.
通常,语义属性将是关联的输入序列或这些字符的某些功能。
In any event, the two-level parse involves feeding the output non-terminals of the lexical level as terminals for the syntactic level. 无论如何,两级解析涉及将词法级的输出非终端作为句法级的终端。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.