[英]Python Lex-Yacc (PLY) Error recovery at the end of input
I am trying to implement an error tolerant parser using Python Lex-Yacc (PLY), but I have trouble using error recovery rules at the end of my input string. 我正在尝试使用Python Lex-Yacc(PLY)实现容错解析器,但我在输入字符串末尾使用错误恢复规则时遇到问题。
How can I recover from an unexpected end of input? 如何从意外的输入结束中恢复?
This example grammar produces strings of the form A END A END A END A END
...
此示例语法生成A END A END A END A END
形式A END A END A END A END
字符串...
Statement : Expressions
Expressions : Expression Expressions
|
Expression : A END
I want to perform an error recovery if the END Token was omitted, so stings like AAA END
or AAA
will be recognized by the parser. 如果省略END Token,我想执行错误恢复,因此解析器将识别AAA END
或AAA
字符串。
I added an error recovery rule, which allows me to accept input like AAA END
我添加了一个错误恢复规则,它允许我接受像AAA END
这样的输入
Expression : A END
| A error
Which allows me to accept the following input: AAA END
这允许我接受以下输入: AAA END
But if the last END token is omitted ( AAA
), I still get a syntax error and cannot recover. 但是如果省略最后一个END标记( AAA
),我仍然会收到语法错误并且无法恢复。
from __future__ import print_function
# Tokens
tokens = ('A', 'END')
t_A = r'A'
t_END = r'END'
t_ignore = " "
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
# Build the lexer
import ply.lex as lex
lex.lex()
# Rules
def p_statement_expr(p):
'''statement : expressions'''
print("parsed:", p[1])
def p_expressions(p):
'''expressions : expression expressions'''
p[0] = [p[1]] + p[2]
def p_expressions_empty(p):
'''expressions : '''
p[0] = list()
def p_expression_pharse(p):
'''expression : A END
| A error'''
p[0] = 'A'
def p_error(p):
if p:
print("Syntax error at '%s'" % p.value)
else:
print("Syntax error at EOI")
import ply.yacc as yacc
yacc.yacc()
while 1:
try:
s = raw_input('query > ') # use input() on Python 3
except EOFError:
break
yacc.parse(s)
I add it as a new answer (and do know it is too late for the bounty :-( ) because it is a very different approach. If we used flex
, it would be much easier, since it has the notion of the <<EOF>>
token that matches only at end of file. After thinking about that, I realized that it was very simple to add that functionality to PLY without any change to the original module by using a proxy around the lexer. And Python allows easy implementation of proxies thanks the the __getattr__
special method. 我把它添加为一个新的答案(并且知道它对于赏金来说为时已晚:-()因为它是一种非常不同的方法。如果我们使用flex
,它会更容易,因为它具有<<EOF>>
的概念<<EOF>>
仅在文件末尾匹配的令牌。在考虑了这一点后,我意识到通过在词法分析器周围使用代理 ,将该功能添加到PLY 而不对原始模块进行任何更改是非常简单的。并且Python允许轻松实现代理感谢__getattr__
特殊方法。
I just add 我只是补充一下
EOF
that will be send at end of file 将在文件末尾发送的新令牌EOF
token
method of the lexer that on end of file returns the special EOF
token on first pass and then the normal None
词法分析器的token
方法周围的代理,在文件末尾返回第一遍的特殊EOF
令牌,然后是正常的None
statement
rule eof令牌结束statement
规则 And still reverse the rule expressions : expressions expression
instead of expressions : expression expressions
to allow immediate reduce 并且仍然反转规则expressions : expressions expression
而不是expressions : expression expressions
允许立即减少
The code becomes : 代码变成:
from __future__ import print_function
# Tokens
tokens = ('A', 'END', 'EOF')
t_A = r'A'
t_END = r'END'
t_ignore = " "
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
# Build the lexer
import ply.lex as lex
orig_lexer = lex.lex()
class ProxyLexer(object):
def __init__(self, lexer, eoftoken):
self.end = False
self.lexer = lexer
self.eof = eoftoken
def token(self):
tok = self.lexer.token()
if tok is None:
if self.end :
self.end = False
else:
self.end = True
tok = lex.LexToken()
tok.type = self.eof
tok.value = None
tok.lexpos = self.lexer.lexpos
tok.lineno = self.lexer.lineno
# print ('custom', tok)
return tok
def __getattr__(self, name):
return getattr(self.lexer, name)
lexer = ProxyLexer(orig_lexer, 'EOF')
# Rules
def p_statement_expr(p):
'''statement : expressions EOF'''
print("parsed:", p[1])
def p_expressions(p):
'''expressions : expressions expression'''
p[0] = p[1] + [p[2]]
def p_expressions_empty(p):
'''expressions : '''
p[0] = list()
def p_expression_pharse(p):
'''expression : A END
| A error'''
p[0] = 'A'
def p_error(p):
if p:
print("Syntax error at '%s'" % p.value)
else:
print("Syntax error at EOI")
import ply.yacc as yacc
parser = yacc.yacc()
while 1:
try:
s = raw_input('query > ') # use input() on Python 3
except EOFError:
break
parser.parse(s, lexer = lexer)
That way : 那样 :
As you want to accept all elements, you can explicitely declare a rule for a A
not followed by a END
and use the fact that yacc and PLY friendly deal with ambiguous rules. 当你想接受所有元素时,你可以明确地声明A
的规则,而不是END
并使用yacc和PLY friendly处理模糊规则的事实。
You can simply have a normal rule : 你可以简单地有一个正常的规则:
Expression : A END
and below a lower priority rule (as it comes later) that will issue a warning 并且低于优先级较低的规则(如下所示)将发出警告
Expression : A
That way, all A will be accepted, there won't be any syntax error, and the warning will be issued for any A not followed by a END including one at the end of the flow. 这样,所有A都将被接受,不会有任何语法错误,并且将针对任何A发出警告,而不是在流程结束时包括一个END。 In order to more easily find the offending A, I have added in the warning the position of the symbol in the flow. 为了更容易找到有问题的A,我在警告中添加了符号在流程中的位置。
Edit: 编辑:
The script is modified to correctly deal with other syntax error (such as AENDENDAEND
), and also to immediately reduce expressions
by replacing expressions : expression expressions
with expressions : expressions expression
脚本被修改为与其他语法错误(如正确处理AENDENDAEND
),并且还立即降低expressions
通过替换expressions : expression expressions
与expressions : expressions expression
Here is the modified script (tested in python 3.4 simply replacing raw_input
with input
): 这是修改后的脚本(在python 3.4中测试,只需用input
替换raw_input
):
from __future__ import print_function
# Tokens
tokens = ('A', 'END')
t_A = r'A'
t_END = r'END'
t_ignore = " "
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
# Build the lexer
import ply.lex as lex
lex.lex()
# Rules
def p_statement_expr(p):
'''statement : expressions'''
print("parsed:", p[1])
def p_expressions(p):
'''expressions : expressions expression'''
p[0] = p[1] + [p[2]]
def p_expressions_err(p):
'''expressions : expressions error'''
p[0] = p[1]
def p_expressions_empty(p):
'''expressions : '''
p[0] = list()
def p_expression_pharse(p):
'''expression : A END'''
p[0] = 'A'
# add a separate rule BELOW previous one to display a warning
def p_expression_pharse_warn(p):
'''expression : A'''
print("Warning at absolute position %d (line %d)" % (p.lexpos(1), p.lineno(1)))
p[0] = 'A'
def p_error(p):
if p:
print("Syntax error at '%s'" % p.value)
else:
print("Syntax error at EOI")
import ply.yacc as yacc
yacc.yacc()
while 1:
try:
s = raw_input('query > ') # use input() on Python 3
except EOFError:
break
yacc.parse(s)
Edit : the following is an incorrect attempt to avoid an additional rule : it is more complex and less efficient than the above version. 编辑:以下是避免附加规则的错误尝试:它比上述版本更复杂,效率更低。 Please see my conclusion below 请看下面的结论
Edit per comment : 根据评论编辑:
I understand your point that you do not want to multiply grammar rules. 我理解你的观点,你不想增加语法规则。 It is possible to be fault tolerant, except for last token. 除最后一个令牌外,它可以是容错的。 If your last token is in error, it will not be followed by anything and will never be caught in rule expression : A error
. 如果您的最后一个令牌出错,它将不会被任何内容跟随,并且永远不会被规则expression : A error
捕获expression : A error
。
But here is a fault tolerant parser that keeps everything except last token if case of error on that one : 但是这里是一个容错的解析器,如果出现错误的情况,它会保留除最后一个令牌之外的所有内容:
from __future__ import print_function
# Tokens
tokens = ('A', 'END')
t_A = r'A'
t_END = r'END'
t_ignore = " "
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
# Build the lexer
import ply.lex as lex
lex.lex()
# Rules
def p_statement_expr(p):
'''statement : expressions'''
# print("parsed:", p[1])
def p_expressions(p):
'''expressions : expressions expression'''
p[0] = p[1] + [p[2]]
result.append(p[2])
def p_expressions_empty(p):
'''expressions : '''
p[0] = list()
def p_expression_pharse(p):
'''expression : A END
| A error'''
p[0] = 'A'
def p_error(p):
if p:
global lasterr
print("Syntax error at '%s' (%d)" % (p.value, p.lexpos))
else:
print("Syntax error at EOI")
import ply.yacc as yacc
yacc.yacc()
while 1:
try:
s = input('query > ') # use input() on Python 3
except EOFError:
break
result = []
yacc.parse(s)
print('Result', result)
The princip is to collate by expressions : expressions expression
instead of expressions : expression expressions
, and to keep all in a global variable. 原则是通过expressions : expressions expression
进行整理expressions : expressions expression
而不是expressions : expression expressions
,并将all保留在全局变量中。
With an input of A END AA END AAA END
it gives 输入A END AA END AAA END
时给出
Result ['A', 'A', 'A', 'A', 'A', 'A']
and with : A END AA END AAA END
, it gives 并且: A END AA END AAA END
,它给出
Result ['A', 'A', 'A', 'A', 'A']
(all tokens but the last) (所有令牌,但最后一个)
With a true flex - bison solution, it would be possible to make use of the special <<EOF>>
token that matches at end of input, to always have another token after the last one. 使用真正的flex-bison解决方案,可以使用在输入结束时匹配的特殊<<EOF>>
标记,以便在最后一个标记之后始终具有另一个标记。 Unfortunately, it is not implemented in PLY, and the only real solution is to introduce a rule that accepts alone A
token. 不幸的是,它没有在PLY中实现,唯一真正的解决方案是引入一个单独接受A
令牌的规则。 For a real parser, it also guarantees that you are actually processing the correct token : I used 对于真正的解析器,它还保证您实际处理正确的令牌:我使用过
def p_expression_pharse(p):
'''expression : A END'''
p[0] = 1 + p.lexpos(1)
# add a separate rule BELOW previous one to display a warning
def p_expression_pharse_warn(p):
'''expression : A'''
print("Warning at absolute position %d (line %d)" % (p.lexpos(1), p.lineno(1)))
p[0] = -1 - p.lexpos(1)
to uniquely identify tokens in resul string, and I get correct positions. 唯一地标识结果字符串中的标记,我得到正确的位置。
And ... the error processing is very simple ... 并且......错误处理非常简单......
Discussion TL/DR : 讨论TL / DR:
I admit I missed the point of last token error recovery. 我承认我错过了上次令牌错误恢复的重点。 It is because in all parsers I've seen in real use cases, the error recovery consisted in rejecting the part that was syntactically incorrect (and thus not directly useable) and re-synchonizing the parser on next correct group of token . 这是因为在我在实际用例中看到的所有解析器中,错误恢复包括拒绝语法不正确的部分(因此不能直接使用)并在下一个正确的令牌组上重新同步解析器。 In all what I have seen, if a partial sentence can be used, it must not be processed by the error recovery mechanizme but by a grammar rule, in which it is easy to describe the appropriate action. 在我所看到的所有内容中,如果可以使用部分句子 ,则它不能由错误恢复机制处理,而是由语法规则处理,其中很容易描述适当的操作。
If you just want to keep the offending input for later processing, I think it is not a problem of action depending of a syntax, and I would simply note the position of offending token, or at most note the position of last correctly analysed token (the end of a complete element), the begin of first error recovery token and say that what is between is incorrect. 如果您只是想保留有问题的输入以供以后处理,我认为这不是一个行动问题,这取决于语法,我只是记下违规令牌的位置,或者最多注意最后正确分析的令牌的位置(完整元素的结束),第一个错误恢复令牌的开始,并说两者之间是不正确的。
But it would be much different than what is asked here ... 但它会与此处提出的内容大不相同......
This works for all examples I could imagine 这适用于我能想象的所有例子
from __future__ import print_function
# Tokens
tokens = ('A', 'END')
t_A = r'A'
t_END = r'END'
t_ignore = " "
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
# Build the lexer
import ply.lex as lex
lex.lex()
# Rules
def p_statement_expr(p):
'''statement : expressions'''
#
print("parsed:", p[1])
def p_expressions(p):
'''expressions : expression expressions'''
p[0] = p[1] + p[2]
def p_expressions_empty(p):
'''expressions : '''
p[0] = list()
def p_expression_pharse(p):
'''expression : A END'''
p[0] = ['A']
def p_expression_error(p):
'''expression : A error'''
p[0] = ['A']
if p[2] is not None:
p[0] += p[2]
def p_error(p):
if p is None:
print("Syntax error at EOI")
e = yacc.YaccSymbol()
e.type = 'error'
e.value = None
yacc.errok()
return e
elif p.type == 'error':
yacc.errok()
return
elif hasattr(p, 'value'):
print("Syntax error at '%s'" % p.value)
e = yacc.YaccSymbol()
e.type = 'error'
e.value = p.value
yacc.errok()
return e
import ply.yacc as yacc
yacc.yacc()
while 1:
try:
s = raw_input('query > ') # use input() on Python 3
except EOFError:
break
yacc.parse(s)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.