简体   繁体   English

用于自动分号插入的 Bison 错误恢复

[英]Bison error recovery for automatic semicolon insertion

I'm trying to write a Bison C++ parser for parsing JavaScript files, but I can't figure out how to make the semicolon optional.我正在尝试编写一个 Bison C++ 解析器来解析 JavaScript 文件,但我不知道如何使分号成为可选的。

As to ECMAScript 2018 specification ( https://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf , chapter 11.9), semicolon isn't actually optional, instead it is inserted automatically during the parsing.至于 ECMAScript 2018 规范( https://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf ,第 11.9 章),分号实际上不是可选的,而是在解析。 In the specification, it is stated that:在规范中,它指出:

When, as the source text is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true:当从左到右解析源文本时,遇到任何语法产生式都不允许的记号(称为违规记号)时,如果出现下列情况中的一个或多个,则会在违规记号前自动插入分号以下条件为真:

  • The offending token is separated from the previous token by at least one LineTerminator[...]违规令牌与前一个令牌之间至少有一个 LineTerminator[...]

According to this, I'm trying to solve this problem in this naive way:据此,我试图以这种天真的方式解决这个问题:

  • Detect the error, using the error special token;检测错误,使用error特殊标记;
  • Tell the lexer that a syntax error occurred, during the action;告诉词法分析器在操作期间发生了语法错误; if it has encountered a newline character before the current token, the lexer will return a new semicolon token at the next yylex call;如果在当前标记之前遇到换行符,词法分析器将在下一次yylex调用时返回一个新的分号标记; at the subsequent call, it will return the token that previously was the offending one when the syntax error occurred.在随后的调用中,它将返回先前发生语法错误时有问题的标记。

A very simplified structure of my parser is like the following:我的解析器的一个非常简化的结构如下所示:

program:
   stmt_list END
;

stmt_list:
    %empty
 |  stmt_list stmt
 |  stmt_list error  { /* error detected; tell the lexer about the syntax error */ }
;

stmt:
    value SEMICOLON
|   [other types of statements...]
;

value:
    NUMBER
|   STRING
;

But doing this way, in case the file contains a valid JavaScript statement without a terminating semicolon, but a newline character, when an offending token is encountered, the parser reduces the rest of the statement into an error special token.但是这样做,如果文件包含一个没有终止分号但有一个换行符的有效 JavaScript 语句,当遇到有问题的标记时,解析器会将语句的其余部分减少为error特殊标记。 As I tell the lexer about the syntax error, the parser has already reduced the error token into stmt_list one and the previous valid instruction is lost, making the semicolon insertion useless.当我告诉词法分析器语法错误时,解析器已经将error标记减少到stmt_list一个,并且之前的有效指令丢失,使得分号插入无用。

Obviously I don't want to let my parser discard the valid statement and go to the next one.显然我不想让我的解析器丢弃有效的语句并转到下一个。

How can I make this possible?我怎样才能做到这一点? Is this the right approach or am I missing something?这是正确的方法还是我错过了什么?

I don't believe this approach is workable.我不相信这种方法是可行的。

Just as a note, you would have to detect the error before any reduction takes place.请注意,您必须在发生任何减少之前检测到错误。 So for semicolon insertion at the end of a statement, you need to add the error production to stmt , not stmt_list .因此,对于语句末尾的分号插入,您需要将错误产生式添加到stmt ,而不是stmt_list So you would end up with something like this:所以你最终会得到这样的结果:

stmt_list
     :  %empty
     |  stmt_list stmt

stmt: value ';'   { handle_value_stmt(); }
    | value error { handle_value_stmt(); }
    | [other types of statements...]

That doesn't insert a semicolon;这不会插入分号; it just pretends that the semicolon was inserted.它只是假装插入了分号。 (If a semicolon couldn't be inserted, then another error will be triggered.) (如果无法插入分号,则会触发另一个错误。)

But since it doesn't involve the lexer, it will happen whether or not the missing semicolon was at the end of a line, which is too enthusiastic.但是因为不涉及词法分析器,所以不管是不是行尾漏分号都会发生,太热情了。 So the ideal solution would be to somehow tell the lexer to generate a semicolon token as the next token.所以理想的解决方案是以某种方式告诉词法分析器生成一个分号标记作为下一个标记。 But at the point where the error is detected, the lexer has already produced the lookahead token, and the parser knows what the lookahead token is.但是在检测到错误的时候,词法分析器已经产生了先行标记,解析器知道先行标记是什么。 And it will use its recorded lookahead token to continue the parse.它将使用其记录的前瞻标记来继续解析。

There's also the question of how it is possible to communicate with the lexer at this point, since Mid-Rule Actions don't really play well with the error recovery algorithm.还有一个问题是此时如何与词法分析器进行通信,因为中间规则操作与错误恢复算法并不能很好地配合。 In theory, you could use the fact that yyerror will be called to report the error, but that means that yyerror needs to be able to deduce that this is not a "real" error, which means it will have to go poking into yyparse 's guts.理论上,您可以使用yyerror将被调用来报告错误的事实,但这意味着yyerror需要能够推断出这不是“真正的”错误,这意味着它必须进入yyparse '胆量。 (I'm sure this is possible but I don't know how to do it off the top of my head, and it doesn't seem to me to be recommendable.) (我确信这是可能的,但我不知道如何做到这一点,而且在我看来它并不值得推荐。)

Now, in theory it is possible to tell the parser to discard the lookahead token, and to tell the lexer to generate a semicolon followed by a repeat of the token it just sent.现在,理论上可以告诉解析器丢弃前瞻标记,并告诉词法分析器生成一个分号,后跟它刚刚发送的标记的重复。 So it is just barely possible that by piling hack onto hack, you could make this work, if you're stubborn enough.因此,如果您足够顽固,通过将 hack 堆积在 hack 上,您几乎不可能完成这项工作。 But you'd end up with something very difficult to maintain, verify and test.但是你最终会得到一些非常难以维护、验证和测试的东西。 (And making sure that it works in all corner cases will also be a challenge.) (并确保它在所有极端情况下都有效也将是一个挑战。)

And that's without looking at the other cases where semicolons could be inserted.这还没有考虑可以插入分号的其他情况。

My approach to ASI was to simply analyse the grammar by figuring out which pairs of consecutive tokens are possible.我对 ASI 的方法是通过找出哪些连续标记对是可能的来简单地分析语法。 (That's easy to do; you just need to construct FIRST and LAST sets, and then read through all the productions looking at consecutive symbols.) Then if the input consists of token A followed by one or more newlines followed by token B, and it is not possible for A to be followed by B in the grammar, then that's a candidate for semicolon insertion. (这很容易做到;您只需要构造 FIRST 和 LAST 集,然后阅读所有查看连续符号的产生式。)然后如果输入由标记 A 后跟一个或多个换行符和标记 B 组成,并且它在语法中 A 后面不可能跟 B ,那么这是分号插入的候选者。 The semicolon insertion might fail, but that will generate a syntax error, so you can't get a false positive.分号插入可能会失败,但这会产生语法错误,因此您不会得到误报。 (You might have to fix the syntax error message, but at that point you at least know that you've inserted a semicolon.) (您可能需要修复语法错误消息,但此时您至少知道您插入了一个分号。)

Proving that that algorithm works is trickier, because it could theoretically be the case that A could be followed by B in some context but that it is not possible in the current context, while A ; B证明该算法有效更棘手,因为理论上可能会出现这样的情况,即A可以在某些上下文中跟在B之后,但在当前上下文中是不可能的,而A ; B A ; B would be possible in the current context. A ; B在当前情况下是可能的。 In that case, you might miss a possible semicolon insertion.在这种情况下,您可能会错过可能的分号插入。 I haven't looked in detail at recent JS versions, but long ago when I wrote a JS lexer, I managed to prove to my own satisfaction that there are no such cases.我没有详细查看最近的 JS 版本,但是很久以前当我写一个 JS 词法分析器时,我设法证明了我自己满意的情况,没有这样的情况。


Note: since the question was raised in a comment, I'll add a little hand-waving, although I really don't recommend following this approach.注意:由于问题是在评论中提出的,我会稍微挥手,尽管我真的不建议遵循这种方法。

Without diving into bison's guts, it's really not possible to "unshift" a token, including the error token (which is a real token, more or less).如果不深入了解野牛的内脏,真的不可能“取消移动”一个标记,包括error标记(或多或少是一个真正的标记)。 By the time the error token has been shifted, the parse is effectively committed to an error production.error标记被转移时,解析实际上被提交到错误产生。 So if you want to annul the error, you have to accept that fact and work around it.所以如果你想取消错误,你必须接受这个事实并解决它。

After an error token has been shifted, the parser will then skip tokens until a shiftable token is encountered.error标记被移动后,解析器将跳过标记直到遇到可移动标记。 So if you've managed to insert an automatic semicolon into the token stream, you can use that token as a guard:因此,如果您设法将自动分号插入令牌流,则可以使用该令牌作为保护:

    stmt: value ';'       { handle_value_stmt(); }
        | value error ';' { handle_value_stmt(); }

However, you might not have managed to insert an automatic semi-colon, in which case you really need to report the syntax error (and maybe attempt to resynchronise).但是,您可能无法插入自动分号,在这种情况下,您确实需要报告语法错误(并可能尝试重新同步)。 The above rules would just silently drop tokens up to the next semicolon, which is certainly wrong.上面的规则只会默默地将标记放到下一个分号,这肯定是错误的。 So a first approximation would be for your ASI inserter to always insert something, which can be used as a guard in the error productions:所以第一个近似值是你的 ASI 插入器总是插入一些东西,它可以在错误产生中用作保护:

    stmt: value ';'       { handle_value_stmt(); }
        | value error ';' { handle_value_stmt(); }
        | value error NO_ASI { handle_real_error(); }

That's sufficient for "abort on error" processing, but if you want to do error recovery, you'll need to do some more hackery.这对于“错误中止”处理来说已经足够了,但是如果你想进行错误恢复,你需要做更多的黑客。

As I said, I really don't recommend going down this route.正如我所说,我真的不建议走这条路。 The end result won't be pretty, even if it works (and you still might find that code which you thought worked fails on real user input, in a case you didn't consider.)最终结果不会很漂亮,即使它有效(并且您仍然可能会发现您认为有效的代码在实际用户输入时失败,如果您没有考虑过。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM