简体   繁体   English

Python Lex-Yacc(PLY)输入结束时的错误恢复

[英]Python Lex-Yacc (PLY) Error recovery at the end of input

Problem 问题

I am trying to implement an error tolerant parser using Python Lex-Yacc (PLY), but I have trouble using error recovery rules at the end of my input string. 我正在尝试使用Python Lex-Yacc(PLY)实现容错解析器,但我在输入字符串末尾使用错误恢复规则时遇到问题。

How can I recover from an unexpected end of input? 如何从意外的输入结束中恢复?

Example

This example grammar produces strings of the form A END A END A END A END ... 此示例语法生成A END A END A END A END形式A END A END A END A END字符串...

Statement   : Expressions

Expressions : Expression Expressions
            | 

Expression  : A END

I want to perform an error recovery if the END Token was omitted, so stings like AAA END or AAA will be recognized by the parser. 如果省略END Token,我想执行错误恢复,因此解析器将识别AAA ENDAAA字符串。

My approach 我的方法

I added an error recovery rule, which allows me to accept input like AAA END 我添加了一个错误恢复规则,它允许我接受像AAA END这样的输入

Expression : A END
           | A error

Which allows me to accept the following input: AAA END 这允许我接受以下输入: AAA END

But if the last END token is omitted ( AAA ), I still get a syntax error and cannot recover. 但是如果省略最后一个END标记( AAA ),我仍然会收到语法错误并且无法恢复。


Sample PLY code 样本PLY代码

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expression expressions'''
    p[0] = [p[1]] + p[2]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    yacc.parse(s)

I add it as a new answer (and do know it is too late for the bounty :-( ) because it is a very different approach. If we used flex , it would be much easier, since it has the notion of the <<EOF>> token that matches only at end of file. After thinking about that, I realized that it was very simple to add that functionality to PLY without any change to the original module by using a proxy around the lexer. And Python allows easy implementation of proxies thanks the the __getattr__ special method. 我把它添加为一个新的答案(并且知道它对于赏金来说为时已晚:-()因为它是一种非常不同的方法。如果我们使用flex ,它会更容易,因为它具有<<EOF>>的概念<<EOF>>仅在文件末尾匹配的令牌。在考虑了这一点后,我意识到通过在词法分析器周围使用代理 ,将该功能添加到PLY 而不对原始模块进行任何更改是非常简单的。并且Python允许轻松实现代理感谢__getattr__特殊方法。

I just add 我只是补充一下

  • a new token EOF that will be send at end of file 将在文件末尾发送的新令牌EOF
  • a proxy around the token method of the lexer that on end of file returns the special EOF token on first pass and then the normal None 词法分析器的token方法周围的代理,在文件末尾返回第一遍的特殊EOF令牌,然后是正常的None
  • the eof token to end statement rule eof令牌结束statement规则

And still reverse the rule expressions : expressions expression instead of expressions : expression expressions to allow immediate reduce 并且仍然反转规则expressions : expressions expression而不是expressions : expression expressions允许立即减少

The code becomes : 代码变成:

from __future__ import print_function

# Tokens
tokens = ('A', 'END', 'EOF')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex

orig_lexer = lex.lex()

class ProxyLexer(object):
    def __init__(self, lexer, eoftoken):
        self.end = False
        self.lexer = lexer
        self.eof = eoftoken
    def token(self):
        tok = self.lexer.token()
        if tok is None:
            if self.end :
                self.end = False
            else:
                self.end = True
                tok = lex.LexToken()
                tok.type = self.eof
                tok.value = None
                tok.lexpos = self.lexer.lexpos
                tok.lineno = self.lexer.lineno
        # print ('custom', tok)
        return tok
    def __getattr__(self, name):
        return getattr(self.lexer, name)

lexer = ProxyLexer(orig_lexer, 'EOF')

# Rules
def p_statement_expr(p):
    '''statement : expressions EOF'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expressions expression'''
    p[0] = p[1] + [p[2]]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
parser = yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    parser.parse(s, lexer = lexer)

That way : 那样 :

  • the original grammar is unchanged 原始语法没有变化
  • the error recovery method remains stupidly simple and has no dependance on the remaining of the grammar 错误恢复方法仍然很简单,并且不依赖于语法的剩余部分
  • it can be easily extended to complex parsers 它可以很容易地扩展到复杂的解析器

As you want to accept all elements, you can explicitely declare a rule for a A not followed by a END and use the fact that yacc and PLY friendly deal with ambiguous rules. 当你想接受所有元素时,你可以明确地声明A的规则,而不是END并使用yacc和PLY friendly处理模糊规则的事实。

You can simply have a normal rule : 你可以简单地有一个正常的规则:

Expression : A END

and below a lower priority rule (as it comes later) that will issue a warning 并且低于优先级较低的规则(如下所示)将发出警告

Expression : A

That way, all A will be accepted, there won't be any syntax error, and the warning will be issued for any A not followed by a END including one at the end of the flow. 这样,所有A都将被接受,不会有任何语法错误,并且将针对任何A发出警告,而不是在流程结束时包括一个END。 In order to more easily find the offending A, I have added in the warning the position of the symbol in the flow. 为了更容易找到有问题的A,我在警告中添加了符号在流程中的位置。

Edit: 编辑:

The script is modified to correctly deal with other syntax error (such as AENDENDAEND ), and also to immediately reduce expressions by replacing expressions : expression expressions with expressions : expressions expression 脚本被修改为与其他语法错误(如正确处理AENDENDAEND ),并且还立即降低expressions通过替换expressions : expression expressionsexpressions : expressions expression

Here is the modified script (tested in python 3.4 simply replacing raw_input with input ): 这是修改后的脚本(在python 3.4中测试,只需用input替换raw_input ):

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expressions expression'''
    p[0] = p[1] + [p[2]]

def p_expressions_err(p):
    '''expressions : expressions error'''
    p[0] = p[1]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END'''
    p[0] = 'A'

# add a separate rule BELOW previous one to display a warning
def p_expression_pharse_warn(p):
    '''expression : A'''
    print("Warning at absolute position %d (line %d)" % (p.lexpos(1), p.lineno(1)))
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")


import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    yacc.parse(s)

Edit : the following is an incorrect attempt to avoid an additional rule : it is more complex and less efficient than the above version. 编辑:以下是避免附加规则的错误尝试:它比上述版本更复杂,效率更低。 Please see my conclusion below 请看下面的结论

Edit per comment : 根据评论编辑:

I understand your point that you do not want to multiply grammar rules. 我理解你的观点,你不想增加语法规则。 It is possible to be fault tolerant, except for last token. 除最后一个令牌外,它可以是容错的。 If your last token is in error, it will not be followed by anything and will never be caught in rule expression : A error . 如果您的最后一个令牌出错,它将不会被任何内容跟随,并且永远不会被规则expression : A error捕获expression : A error

But here is a fault tolerant parser that keeps everything except last token if case of error on that one : 但是这里是一个容错的解析器,如果出现错误的情况,它会保留除最后一个令牌之外的所有内容:

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    # print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expressions expression'''
    p[0] = p[1] + [p[2]]
    result.append(p[2])

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        global lasterr
        print("Syntax error at '%s' (%d)" % (p.value, p.lexpos))
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = input('query > ')   # use input() on Python 3
    except EOFError:
        break
    result = []
    yacc.parse(s)
    print('Result', result)

The princip is to collate by expressions : expressions expression instead of expressions : expression expressions , and to keep all in a global variable. 原则是通过expressions : expressions expression进行整理expressions : expressions expression而不是expressions : expression expressions ,并将all保留在全局变量中。

With an input of A END AA END AAA END it gives 输入A END AA END AAA END时给出

Result ['A', 'A', 'A', 'A', 'A', 'A']

and with : A END AA END AAA END , it gives 并且: A END AA END AAA END ,它给出

Result ['A', 'A', 'A', 'A', 'A']

(all tokens but the last) (所有令牌,但最后一个)

With a true flex - bison solution, it would be possible to make use of the special <<EOF>> token that matches at end of input, to always have another token after the last one. 使用真正的flex-bison解决方案,可以使用在输入结束时匹配的特殊<<EOF>>标记,以便在最后一个标记之后始终具有另一个标记。 Unfortunately, it is not implemented in PLY, and the only real solution is to introduce a rule that accepts alone A token. 不幸的是,它没有在PLY中实现,唯一真正的解决方案是引入一个单独接受A令牌的规则。 For a real parser, it also guarantees that you are actually processing the correct token : I used 对于真正的解析器,它还保证您实际处理正确的令牌:我使用过

def p_expression_pharse(p):
    '''expression : A END'''
    p[0] = 1 + p.lexpos(1)

# add a separate rule BELOW previous one to display a warning
def p_expression_pharse_warn(p):
    '''expression : A'''
    print("Warning at absolute position %d (line %d)" % (p.lexpos(1), p.lineno(1)))
    p[0] = -1 - p.lexpos(1)

to uniquely identify tokens in resul string, and I get correct positions. 唯一地标识结果字符串中的标记,我得到正确的位置。

And ... the error processing is very simple ... 并且......错误处理非常简单......

Discussion TL/DR : 讨论TL / DR:

I admit I missed the point of last token error recovery. 我承认我错过了上次令牌错误恢复的重点。 It is because in all parsers I've seen in real use cases, the error recovery consisted in rejecting the part that was syntactically incorrect (and thus not directly useable) and re-synchonizing the parser on next correct group of token . 这是因为在我在实际用例中看到的所有解析器中,错误恢复包括拒绝语法不正确的部分(因此不能直接使用)并在下一个正确的令牌组上重新同步解析器。 In all what I have seen, if a partial sentence can be used, it must not be processed by the error recovery mechanizme but by a grammar rule, in which it is easy to describe the appropriate action. 在我所看到的所有内容中,如果可以使用部分句子 ,则它不能由错误恢复机制处理,而是由语法规则处理,其中很容易描述适当的操作。

If you just want to keep the offending input for later processing, I think it is not a problem of action depending of a syntax, and I would simply note the position of offending token, or at most note the position of last correctly analysed token (the end of a complete element), the begin of first error recovery token and say that what is between is incorrect. 如果您只是想保留有问题的输入以供以后处理,我认为这不是一个行动问题,这取决于语法,我只是记下违规令牌的位置,或者最多注意最后正确分析的令牌的位置(完整元素的结束),第一个错误恢复令牌的开始,并说两者之间是不正确的。

But it would be much different than what is asked here ... 但它会与此处提出的内容大不相同......

This works for all examples I could imagine 这适用于我能想象的所有例子

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    #
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expression expressions'''
    p[0] = p[1] + p[2]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END'''
    p[0] = ['A']

def p_expression_error(p):
    '''expression : A error'''
    p[0] = ['A']
    if p[2] is not None:
        p[0] += p[2]

def p_error(p):
    if p is None:
        print("Syntax error at EOI")
        e = yacc.YaccSymbol()
        e.type = 'error'
        e.value = None
        yacc.errok()
        return e
    elif p.type == 'error':
        yacc.errok()
        return
    elif hasattr(p, 'value'):
        print("Syntax error at '%s'" % p.value)
        e = yacc.YaccSymbol()
        e.type = 'error'
        e.value = p.value
        yacc.errok()
        return e




import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    yacc.parse(s)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM