Python Lex-Yacc（PLY）：无法识别行首或字符串首

Question

我对PLY非常陌生，而不仅仅是Python初学者。 我正在尝试使用PLY-3.4和python 2.7进行学习。 请参见下面的代码。 我正在尝试创建一个令牌QTAG，它是由零个其他空格组成的字符串，后跟“ Q”或“ q”，再跟“。”。 以及一个正整数和一个或多个空格。 例如，有效的QTAG是

"Q.11 "
"  Q.12 "
"q.13     "
'''
   Q.14 
'''

无效的是

"asdf Q.15 "
"Q.  15 "

这是我的代码：

import ply.lex as lex

class LqbLexer:
     # List of token names.   This is always required
     tokens =  [
        'QTAG',
        'INT'
        ]


     # Regular expression rules for simple tokens

    def t_QTAG(self,t):
        r'^[ \t]*[Qq]\.[0-9]+\s+'
        t.value = int(t.value.strip()[2:])
        return t

    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
    r'\d+'
    t.value = int(t.value)   
    return t


    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)

    # A string containing ignored characters (spaces and tabs)
    t_ignore  = ' \t'

    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)

    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)

    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok

# test it
q = LqbLexer()
q.build()
#VALID inputs
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test('''
   Q.14 
''')
# INVALID ones are
q.test("asdf Q.15 ")
q.test("Q.  15 ")

我得到的输出如下：

LexToken(QTAG,11,1,0)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,12,1,4)
LexToken(QTAG,13,1,0)
Newline found
Illegal character 'Q'
Illegal character '.'
LexToken(INT,14,2,6)
Newline found
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,7)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,4)

请注意，只有第一个和第三个有效输入已正确标记。 我无法弄清楚为什么我的其他有效输入未正确标记。 在t_QTAG的文档字符串中：

用'\\A'替换'^'无效。
我尝试通过删除'^' 。 然后所有有效输入都会被标记化，但是第二个无效输入也会被标记化。

任何帮助都将不胜感激！

谢谢

PS：我加入了google-group ply-hack并尝试在此发布，但是我既不能直接在论坛中也不能通过电子邮件发布。 我不确定该小组是否已处于活动状态。 Beazley教授也没有回应。 有任何想法吗？

Answer 1

最后，我自己找到了答案。 发布它，以便其他人发现它有用。

正如@Tadgh正确指出的那样， t_ignore = ' \\t'占用了空格和制表符，因此我将无法按照上述regex匹配t_QTAG ，结果是第二个有效输入未标记。 通过仔细阅读PLY文档，我了解到，如果要维护令牌的正则表达式的顺序，则必须在函数中定义它们，而不是像t_ignore那样在字符串中进行t_ignore 。 如果使用了字符串，则PLY会自动按最长至最短长度对其进行排序，并将其附加在函数之后。 我猜这里t_ignore很特殊，它以某种方式先于其他任何东西执行。 这部分没有明确记录。 解决此问题的方法是在 t_QTAG 之后定义一个带有新令牌（例如t_SPACETAB ）的t_QTAG ，只是不返回任何内容。 这样，所有有效输入现在都已正确标记，只有带有三引号的输入（包含"Q.14"的多行字符串）除外。 此外，按照规范，无效令牌不会被标记化。

多行字符串问题：事实证明，内部PLY使用re模块。 在该模块中，默认情况下， ^仅在字符串的开头而不是每行的开头解释。 要更改该行为，我需要打开多行标志，可以使用(?m)在正则表达式中完成此操作。 因此，要正确处理测试中所有有效和无效的字符串，正确的正则表达式为：

r'(?m)^\\s*[Qq]\\.[0-9]+\\s+'

这是更正后的代码，添加了更多测试：

import ply.lex as lex

class LqbLexer:
    # List of token names.   This is always required

    tokens = [
        'QTAG',
        'INT',
        'SPACETAB'
        ]


    # Regular expression rules for simple tokens

    def t_QTAG(self,t):
        # corrected regex
        r'(?m)^\s*[Qq]\.[0-9]+\s+'
        t.value = int(t.value.strip()[2:])
        return t

    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
        r'\d+'
        t.value = int(t.value)    
        return t

    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)

    # A string containing ignored characters (spaces and tabs)
    # Instead of t_ignore  = ' \t'
    def t_SPACETAB(self,t):
        r'[ \t]+'
        print "Space(s) and/or tab(s)"

    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)

    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)

    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok

# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test("""


   Q.14
""")
q.test("""

qewr
dhdhg
dfhg
   Q.15 asda

""")

# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q.  17 ")

这是输出：

-============Testing some VALID inputs===========-
LexToken(QTAG,11,1,0)
LexToken(QTAG,12,1,0)
LexToken(QTAG,13,1,0)
LexToken(QTAG,14,1,0)
Newline found
Illegal character 'q'
Illegal character 'e'
Illegal character 'w'
Illegal character 'r'
Newline found
Illegal character 'd'
Illegal character 'h'
Illegal character 'd'
Illegal character 'h'
Illegal character 'g'
Newline found
Illegal character 'd'
Illegal character 'f'
Illegal character 'h'
Illegal character 'g'
Newline found
LexToken(QTAG,15,6,18)
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'a'
Newline found
-============Testing some INVALID inputs===========-
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,16,8,7)
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
Space(s) and/or tab(s)
LexToken(INT,17,8,4)
Space(s) and/or tab(s)

Python Lex-Yacc（PLY）：无法识别行首或字符串首

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-05-30 06:55:55

Python Lex-Yacc（PLY）：无法识别行首或字符串首

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-05-30 06:55:55

解决方案1
3 已采纳 2014-05-30 06:55:55