简体   繁体   English

使用解析器重新同步处理PLY.yacc错误

[英]PLY.yacc error handling using parser resynchronization

I'm trying implement a user-friendly syntax error handling during the parsing. 我正在尝试在解析过程中实现用户友好的语法错误处理。 From what I've observed in the official PLY documentation . 根据我在官方PLY 文档中的观察。 One way is to raise an exception when the first SyntaxError occurs and terminate the parsing. 一种方法是在第一个SyntaxError发生时引发异常并终止解析。 However I would like to do something similar, as the documentation suggests, to use the parser resynchronization technique. 但是,正如文档所建议的,我想做类似的事情来使用解析器重新同步技术。

The documentation says: 该文件说:

The most well-behaved approach for handling syntax errors is to write grammar rules that include the error token. 处理语法错误的最完善的方法是编写包含错误标记的语法规则。 For example, suppose your language had a grammar rule for a print statement like this: 例如,假设您的语言对这样的打印语句具有语法规则:

 def p_statement_print(p): 'statement : PRINT expr SEMI' ... 

To account for the possibility of a bad expression, you might write an additional grammar rule like this: 为了解决表达错误的可能性,您可以编写一个附加的语法规则,如下所示:

 def p_statement_print_error(p): 'statement : PRINT error SEMI' print("Syntax error in print statement. Bad expression") 

I have a grammar excerpt like this: 我有一个这样的语法摘录:

def p_operation(self, p) -> None:
    '''
    operation : unaryOperation
              | binaryOperation
    '''

def p_unaryOperation(self, p) -> None:
    '''
    unaryOperation : unaryOperation L_SQUARE_BRACKET projection R_SQUARE_BRACKET
                   | RELATION_NAME
    '''

def p_projection(self, p) -> None:
    '''
    projection : multipleAttributes
               | attribute
    '''

def p_multipleAttributes(self, p) -> None:
    '''
    multipleAttributes : projection COMMA attribute
    '''

def p_attribute(self, p) -> None:
    '''
    attribute : ATTRIBUTE
    '''

I'm quite unsure how should I define such new rules including the error token. 我不确定如何定义包括error令牌的新规则。 Should I replace every non-terminal with the error token? 我应该用error令牌替换每个非终端吗?

Looking forward to seeing your replies! 期待收到您的答复! Thanks a lot for your help 非常感谢你的帮助

You definitely should not add an error production for every non-terminal. 您绝对不应该为每个非终端添加错误产生。

Resynchronisation works when there is some token which would normally reset the parsing context to a known state. 当存在某些令牌时,重新同步将正常工作,这些令牌通常会将解析上下文重置为已知状态。 In languages with a clear end-of-statement marker -- a semicolon in the example you cite -- that token works well as a resynchronisation point. 在具有清晰的语句结尾标记的语言中(在您引用的示例中为分号),该标记可以很好地用作重新同步点。 Discarding text up to the next semicolon and then parsing from there won't work 100% of the time, but it does work in many cases. 丢弃文本直到下一个分号,然后再从那里进行解析不会在100%的时间内起作用,但是在许多情况下它确实起作用。

Parentheses and brackets can also be used as resynchronisation points, but the heuristic is not as reliable, because many syntax errors are the result of mismatched parentheses or brackets. 括号和括号也可以用作重新同步点,但是启发式方法不那么可靠,因为许多语法错误是括号或括号不匹配的结果。 Scanning for a missing close bracket could discard the entire input, for example. 例如,扫描缺少的右方括号可能会丢弃整个输入。

Resynchronisation is more complicated in the case of languages without clear statement delimiters, including languages like Python where a newline only terminate statements if they are not nested within parentheses. 在没有明确的语句定界符的语言中,重新同步更为复杂,包括像Python这样的语言,其中的换行符仅在不嵌套在括号内的情况下才终止语句。 Discarding up to a newline might work, but you might have to deal with feedback between the scanner and the parser which determines when a newline is transmitted as a token and when it is skipped as whitespace. 丢弃最多换行符可能会起作用,但是您可能必须处理扫描程序和解析器之间的反馈,该反馈确定何时将换行符作为令牌发送以及何时将其作为空白符跳过。

Inconsistent indentation can be a useful resynchronisation trigger, with a couple of caveats. 缩进不一致可能是有用的重新同步触发器,但有几点警告。 First, you must not reject valid input with "misleading" indentation, so the trigger needs to be more sensitive during resynchronisation than during normal parsing. 首先,您一定不能拒绝带有“误导”缩进的有效输入,因此触发器在重新同步过程中比在正常解析过程中需要更加敏感。 Second, tracking inconsistent indentation definitely requires a parser->scanner back-channel. 其次,跟踪不一致的缩进肯定需要解析器->扫描器反向通道。 So it's more work than simple panic recovery, but it can be effective. 因此,它比简单的紧急恢复要花更多的时间,但它可能是有效的。

The bottom line is that there are few, if any, universal algorithms for good error reporting and recovery. 最重要的是,只有很少的通用算法可以实现良好的错误报告和恢复。 You need to base your strategy on the syntactic nature of the language. 您需要基于语言的句法本质来制定策略。

Ideally, you will want to refine the code by examining your parser's response to common errors, but that can't really be done until you have an actual deployment and can see what the common errors are. 理想情况下,您将希望通过检查解析器对常见错误的响应来优化代码,但是只有在进行实际部署并看到常见错误之后才能真正做到这一点。 So the best advice I can give is to start with a simple recovery strategy and see how it does with different syntax errors, particularly the syntax errors you accidentally created (or those of your friends and collaborators). 因此,我能提供的最佳建议是从一个简单的恢复策略开始,看看它如何处理不同的语法错误,尤其是您偶然创建的语法错误(或您的朋友和合作者的语法错误)。 Keep an archive of different syntax errors encountered, which you can use to test improvements to your diagnosis and recovery code. 保留遇到的各种语法错误的存档,可用于测试对诊断和恢复代码的改进。 Don't expect it to be perfect, since it is a difficult problem, but do try to make it more accurate whenever you can. 别指望它是完美的,因为这是一个难题,但是请尽一切可能使它更准确。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM