简体   繁体   English

在 python PLY(lex/yacc) 中使用空生产规则的语法错误

[英]Syntax error using empty production rule in python PLY(lex/yacc)

The full example is given here:此处给出了完整的示例:

import ply.lex as lex
import Property
# List of token names.   This is always required
tokens = [
    'CheckupInformation',
    'Introduction',
    'Information',
    'perfect',
    'sick',
    'LPAREN',
    'RPAREN',
    'CHAR',
    'NUMBER'
    ] 
def t_CheckupInformation(t)     : 'CheckupInformation'     ; return t
def t_Introduction(t)  : 'Introduction'  ; return t
def t_Information(t) : 'Information' ; return t
def t_perfect(t): 'perfect'; return t
def t_sick(t) : 'sick'; return t



t_LPAREN  = r'\('
t_RPAREN  = r'\)'
t_CHAR = r'[a-zA-Z_][a-zA-Z0-9_\-]*'
t_ignore = " \t"
# Define a rule so we can track line numbers

def t_NUMBER(t):
    r'[+\-0-9_][0-9_]*'
    t.lexer.lineno += len(t.value)
    try:
        t.value = int(t.value)
    except ValueError:
        print("Integer value too large %s" % t.value)
        t.value = 0
    return t
def t_SEMICOLON(t):
    r'\;.*'
    t.lexer.lineno += len(t.value)
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)
# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

 # Build the lexer
lexer = lex.lex()
# define upper level classes first     
class stat:
    def __init__(self):
        self.statement = ""
        self.intro = list()
        self.body = list()


P=stat()
def p_stat(p):
    'Stat : LPAREN CheckupInformation statIntro statBody RPAREN'
    p[0]=(p[1],p[2],p[3],p[4],p[5])

def p_Intro(p) : 
    '''statIntro : LPAREN Introduction Name RPAREN
                 | statIntro LPAREN Introduction Name RPAREN
                 | empty'''

    if len(p)==5:
       p[0] = (p[3])
    elif len(p)==6:
       p[0] = (p[4])
    else:
       p[0]= None
    P.intro.append(p[0])

def p_Name(p):
    'Name : CHAR'
    p[0]=p[1]



def p_Body(p):
    '''statBody : LPAREN Information bodyinfo RPAREN
                | statBody LPAREN Information bodyinfo RPAREN'''
    if len(p)==5:
       p[0] = (p[3])
    elif len(p)==6:
       p[0] = (p[4])
    P.body.append(p[0])
def p_bodyinfo(p):
    '''bodyinfo : LPAREN CHAR perfect RPAREN
                | LPAREN CHAR sick RPAREN'''
    p[0]=p[2],p[3]


def p_empty(p):
    'empty :  '
    print("This function is called")
    pass   
def p_error(p):
    print("Syntax error in input '%s'!" % p.value)

import ply.yacc as yacc
parser = yacc.yacc()
import sys
if len(sys.argv) < 2 :
    sys.exit("Usage: %s <filename>" % sys.argv[0])
fp = open(sys.argv[1])
contents=fp.read()
result=parser.parse(contents)

print("(CheckupInformation")
if (P.intro) != None:
    for x in range(len(P.intro)):
        print("    (Introduction %s)" %(P.intro[x]))
for x in range(len(P.body)):
        print("    (Information( %s %s))" %(P.body[x]))
print(")")

The text File1 is:文本 File1 是:

(CheckupInformation
  (Introduction John)
  (Introduction Patt)
  (Information(Anonymous1 perfect))
  (Information(Anonymous2 sick))
)

The text File2 is:文本 File2 是:

(CheckupInformation
  (Information(Anonymous1 perfect))
  (Information(Anonymous2 sick))
)

According to my bnf grammar 'Intro' is optional.根据我的 bnf 语法,“介绍”是可选的。 The upper code with the text file1 works well.带有文本 file1 的上层代码运行良好。 But when I remove the 'Intro' part from text file as text file2, it gives syntax error at 'body' section that means it cannot handle empty production.但是当我从文本文件中删除“介绍”部分作为文本文件 2 时,它会在“正文”部分给出语法错误,这意味着它无法处理空生产。 Please help me how to solve the error.请帮助我如何解决错误。 How to handle the empty production rule for my code?如何处理我的代码的空生产规则?

Error message: error message snip错误消息:错误消息片段

Your program cannot be run because you import Property , which is not a standard library module.您的程序无法运行,因为您import Property ,它不是标准库模块。 But deleting that line is at least sufficient to get to the point where Ply attempts to build the parser, at which point it generates several warnings, including a shift/reduce conflict warning.但是删除该行至少足以到达 Ply 尝试构建解析器的地步,此时它会生成多个警告,包括 shift/reduce 冲突警告。 This last warning is important;最后一个警告很重要; you should attempt to fix it, and you certainly should not ignore it.你应该尝试修复它,你当然不应该忽视它。 (Which means that you should report it as part of your question.) It is this conflict which is preventing your parser from working. (这意味着您应该将其作为问题的一部分报告。)正是这种冲突阻止了您的解析器工作。

Here's what it says:它是这样说的:

WARNING: Token 'NUMBER' defined, but not used
WARNING: There is 1 unused token
Generating LALR tables
WARNING: 1 shift/reduce conflict

Ply generates the file parser.out , which includes more information about your grammar, including a detailed description of the shift/reduce conflict. Ply 生成文件parser.out ,其中包含有关您的语法的更多信息,包括移位/减少冲突的详细描述。 Examining that file, we find the following:检查该文件,我们发现以下内容:

state 3

    (1) Stat -> LPAREN CheckupInformation . statIntro statBody RPAREN
    (2) statIntro -> . LPAREN Introduction Name RPAREN
    (3) statIntro -> . statIntro LPAREN Introduction Name RPAREN
    (4) statIntro -> . empty
    (10) empty -> .

  ! shift/reduce conflict for LPAREN resolved as shift
    LPAREN          shift and go to state 4

  ! LPAREN          [ reduce using rule 10 (empty -> .) ]

    statIntro                      shift and go to state 5
    empty                          shift and go to state 6

The parser enters State 3 when it is at this point in the processing of Stat :解析器在处理Stat时进入 State 3 :

Stat -> LPAREN CheckupInformation . statIntro statBody RPAREN

The dot between CheckupInformation and statIntro indicates the progress. CheckupInformationstatIntro之间的点表示进度。 There might be more than one production in a state with dots in the middle, which means that the parser has not yet had to figure out which of those alternatives to pick. state 中可能有不止一个产生式,中间有点,这意味着解析器还没有弄清楚要选择哪些替代方案。 There are also productions with the dot at the beginning;也有以点开头的作品; these will correspond to the non-terminal(s) which immediately follow the dot(s), indicating that those productions now need to be considered.这些将对应于紧跟在点之后的非终结符,表明现在需要考虑这些产生式。

There may also productions with the dot at the end, which indicates that at this point in the parse, the sequence of symbols encountered can be "reduced" to the corresponding non-terminal.也可能有结尾带有点的产生式,这表明在解析的这一点上,遇到的符号序列可以“减少”到相应的非终结符。 In other words, the parser has recognised that non-terminal.换句话说,解析器已经识别出非终结符。

Reductions must be performed when they are recognised, or not at all.当它们被识别或根本不被识别时,必须执行减少。 A reduction might not be performed, if the following token -- the "lookahead token" -- cannot follow the non-terminal to be reduced.如果后面的记号——“前瞻记号”——不能跟随要约简的非终结符,则可能不会执行约简。 In general, the parser needs to consider the following questions, which can be immediately answered by consulting the state transition table (these are shown immediately following the productions):一般来说,解析器需要考虑以下问题,这些问题可以通过查阅 state 转换表立即得到解答(这些问题紧跟在产生式之后):

  1. Can the parse progress by continuing with the next token, without performing a reduction?解析是否可以通过继续下一个令牌而不执行缩减来进行? (This is called a "shift" action, because the parser shifts one token to the right in the active production(s).) (这称为“移位”动作,因为解析器在活动产生式中将一个标记向右移动。)
  2. For each possible reduction in this state, can the parse progress by performing that reduction?对于此 state 中的每个可能的缩减,通过执行缩减可以进行解析吗?

A conflict occurs if the answer to more than one of these questions is "yes".如果对这些问题中的一个以上的答案是“是”,则会发生冲突。 That doesn't necessarily mean that the grammar is ambiguous, but it does mean that the parser cannot decide how to choose between the two alternatives.这并不一定意味着语法是模棱两可的,但它确实意味着解析器无法决定如何在两个备选方案之间进行选择。

Like most parser generators, Ply resolves this question using some built-in rules.像大多数解析器生成器一样,Ply 使用一些内置规则解决了这个问题。 One of these rules is that in the absence of other information (precedence declarations), if the answer to the first question was "yes", the parser should proceed without performing any reductions.其中一个规则是,在没有其他信息(优先级声明)的情况下,如果第一个问题的答案是“是”,则解析器应该继续进行而不执行任何归约。

In the particular example of this state, the reduction which could be made is empty: .在这个 state 的特定示例中,可以进行的减少是empty: Although it's not obvious from this state (we'd have to look at the state the parser enters after doing that reduction, which is state 6), after reducing empty , the parser's next move will necessarily be to reduce statIntro -> empty , after which it will go to State 5, which includes the production Although it's not obvious from this state (we'd have to look at the state the parser enters after doing that reduction, which is state 6), after reducing empty , the parser's next move will necessarily be to reduce statIntro -> empty , after它将 go 到 State 5,其中包括生产

Stat -> LPAREN CheckupInformation statIntro . statBody RPAREN

In order for that sequence to be valid, the parser needs to know that it will be able to progress, which means that the lookahead token (in this case ( ) must be a possible input in State 5. Of course, it is because statBody can start with an open parenthesis. So the reduction could be taken.为了使该序列有效,解析器需要知道它将能够进行,这意味着前瞻令牌(在这种情况下( )必须是 State 5 中的可能输入。当然,这是因为statBody可以以左括号开头。因此可以进行归约。

But statIntro could also begin with a ( , so the parser does not have to do the reduction in order to progress. Given those two alternatives, Ply chooses to take the shift action, which means that it discards the possibility that statIntro could be empty and assumes that the ( belongs to a statIntro . If there is a statIntro , this is the correct choice. But if statIntro was missing, the ( belongs to statBody , and the reduction should have been taken.但是statIntro也可以以(开头,因此解析器不必进行归约即可继续进行。鉴于这两种选择,Ply 选择采取 shift 动作,这意味着它放弃了statIntro可能为空的可能性,并且假设(属于statIntro 。如果有statIntro ,这是正确的选择。但是如果缺少statIntro ,则(属于statBody ,并且应该进行缩减。

So that's the problem with your grammar.所以这就是你的语法问题。 It's an indication that the grammar, as written, needs more than one token of lookahead.这表明所写的语法需要不止一个前瞻标记。 Unfortunately, many parser generators, including Ply, do not have a mechanism to cope with grammars which need more than one lookahead token.不幸的是,包括 Ply 在内的许多解析器生成器没有一种机制来处理需要多个前瞻标记的语法。 (If there is some limit to the amount of lookahead needed -- in this case, for example, the conflict could be resolved by looking at the next two tokens -- then it is theoretically possible to find an equivalent grammar for the same language which needs only one lookahead token. But that will have to be your responsibility, because Ply won't do it for you.) (如果需要的前瞻量有一些限制——例如,在这种情况下,可以通过查看接下来的两个标记来解决冲突——那么理论上可以找到相同语言的等效语法只需要一个前瞻令牌。但这必须是你的责任,因为 Ply 不会为你做这件事。)

In this case, the solution is extremely simple.在这种情况下,解决方案非常简单。 It is only necessary to remove the empty production from statIntro , and instead make it optional by providing two productions for Stat , one which has a statIntro and one which doesn't:只需从statIntro中删除空产生式,而是通过为Stat提供两个产生式使其成为可选,一个具有statIntro一个没有:

def p_stat_1(p):
    'Stat : LPAREN CheckupInformation statIntro statBody RPAREN'
    p[0]=(p[1],p[2],p[3],p[4],p[5])

def p_stat_2(p):
    'Stat : LPAREN CheckupInformation           statBody RPAREN'
    p[0]=(p[1],p[2],None,p[3],p[4])

def p_Intro(p) :
    '''statIntro : LPAREN Introduction Name RPAREN
                 | statIntro LPAREN Introduction Name RPAREN
    '''

(I also removed p_empty from the grammar.) (我还从语法中删除了p_empty 。)

This modified grammar does not produce any conflicts, and will parse your test inputs as expected:此修改后的语法不会产生任何冲突,并将按预期解析您的测试输入:

$ python3 learner.py learner.1
(CheckupInformation
    (Introduction John)
    (Introduction Patt)
    (Information( Anonymous1 perfect))
    (Information( Anonymous2 sick))
)
$ python3 learner.py learner.2
(CheckupInformation
    (Information( Anonymous1 perfect))
    (Information( Anonymous2 sick))
)

Postscript:后记:

The transformation suggested above is simple and will work in a large number of cases, not just cases where the conflict can be resolved with a lookahead of two tokens.上面建议的转换很简单,并且可以在大量情况下工作,而不仅仅是可以通过前瞻两个令牌来解决冲突的情况。 But, as noted in a comment, it does increase the size of a grammar, particularly when productions have more than one optional component.但是,正如评论中所指出的,它确实增加了语法的大小,特别是当产生式具有多个可选组件时。 For example, the production:例如,生产:

A : optional_B optional_C optional_D

would have to be expanded into seven different productions, one for each non-empty selection from B C D , and in addition each place where A was used would need to be duplicated to allow for the case where the A was empty.必须将其扩展为七个不同的产品,一个用于B C D中的每个非空选择,此外,每个使用A地方都需要复制以允许A为空的情况。

That seems like a lot, but it might be that not all of these productions are necessary.这似乎很多,但可能并非所有这些作品都是必要的。 The transformation is only needed if there is an overlap between the set of terminals which can start the optional component and the set of symbols which can follow it.只有当可以启动可选组件的终端集合和可以跟随它的符号集合之间存在重叠时,才需要进行转换。 So, for example, if B , C and D can all start with a parenthesis but A cannot be followed by a parenthesis, then optional_D will not cause a conflict, and only B and C would need to be expanded:因此,例如,如果BCD都可以以括号开头,但A后面不能跟括号,则optional_D不会引起冲突,只需扩展BC

A : B C optional_D
  |   C optional_D
  | B   optional_D
  |     optional_D

That requires a bit of grammatical analysis to figure out what can follow A , but in common grammars that's not too hard to do by hand.这需要一些语法分析来弄清楚A后面可以做什么,但在常见的语法中,手工操作并不难。

If that still seems like too many productions, there are a couple of other possibilities which are less general, but which might help anyway.如果这仍然看起来太多作品,还有其他一些不太普遍的可能性,但无论如何可能会有所帮助。

First, you might decide that it doesn't really matter what order B , C and D are presented in the above production.首先,您可能会认为上述产品中出现的顺序BCD并不重要。 In that case, you could replace在这种情况下,您可以更换

A : optional_B optional_C optional_D

with the not-too-complicated, somewhat more accepting alternative:使用不太复杂,更容易接受的替代方案:

A : 
  | A X
X : B | C | D

(or you could avoid X by writing out the alternatives individually in productions of A .) (或者您可以通过在A的产生中单独写出替代方案来避免X 。)

That allows multiple uses of B , C and D , but that appears to correspond to your grammar, in which the optional components are actually possibly empty repetitions.这允许多次使用BCD ,但这似乎与您的语法相对应,其中可选组件实际上可能是空重复。

That leaves the problem of how to produce a reasonable AST, but that's fairly easy to solve, at least in the context of Ply.这就留下了如何产生合理的 AST 的问题,但这很容易解决,至少在 Ply 的上下文中是这样。 Here's one possible practical solution, again assuming that repetition is acceptable:这是一种可能的实际解决方案,再次假设重复是可以接受的:

# This solution assumes that A cannot be followed by
# any token which might appear at the start of a component
def p_A(p):
    """ A : A1 """
    # Create the list of lists B, C, D, as in the original
    p[0] = [ p[1]["B"], p[1]["C"], p[1]["D"] ]

def p_A1_empty(p):
    """ A1 : """
    # Start with a dictionary of empty lists
    p[0] = { "A":[], "B":[], "C":[] }

def p_A1_B(p):
    """ A1 : A1 B """
    p[1]["B"].append(p[2])
    p[0] = p[1]

def p_A1_C(p):
    """ A1 : A1 C """
    p[1]["C"].append(p[2])
    p[0] = p[1]

def p_A1_D(p):
    """ A1 : A1 D """
    p[1]["D"].append(p[2])
    p[0] = p[1]

You could simplify the last three action functions if you arranged for the semantic values of B , C and D to include an indication of what they are.如果您安排BCD的语义值以包含它们是什么的指示,则可以简化最后三个操作函数。 So if, for example, B returned ["B", value] instead of value , then you could combine the last three A1 actions into a single function:因此,例如,如果B返回["B", value]而不是value ,那么您可以将最后三个A1操作组合成一个 function:

def p_A1_BCD(p):
    """ A1 : A1 B
           | A1 C
           | A1 D
    """
    p[1][p[2][0]].append(p[2][1])
    p[0] = p[1]

If none of that is satisfactory, and all of the conflicts can be resolved with one additional lookahead token, then you can try to solve the issue in the lexer.如果这些都不令人满意,并且所有冲突都可以通过一个额外的前瞻令牌来解决,那么您可以尝试在词法分析器中解决问题。

For example, your language seems to entirely consist of S-expressions which start with a open parenthesis followed by some kind of keyword.例如,您的语言似乎完全由 S 表达式组成,这些表达式以左括号开头,后跟某种关键字。 So the lexical analyser could combine the open parenthesis with the following keyword into a single token.因此词法分析器可以将左括号与以下关键字组合成一个标记。 Once you do that, your optional components no longer start with the same token and the conflict vanishes.一旦你这样做了,你的可选组件就不再以相同的标记开始,冲突就消失了。 (This technique is often used to parse XML inputs, which have the same issue: if everything interesting starts with a < , then conflicts abound. But if you recognise <tag as a single token, then the conflicts disappear. In the case of XML, whitespace is not allowed between the < and the tagname; if your grammar does allow whitespace between the ( and the following keyword, your lexer patterns will become slightly more complicated. But it's still manageable.) (这种技术通常用于解析 XML 输入,它们具有相同的问题:如果所有有趣的东西都以<开头,那么冲突就会很多。但是如果您将<tag为单个标记,那么冲突就会消失。在 XML 的情况下, <和标记名之间不允许有空格;如果您的语法允许(和以下关键字之间有空格,您的词法分析器模式将变得稍微复杂一些。但它仍然是可管理的。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM