词法分析

Question

I am learning lexers in Python. 我正在用Python学习词法分析器。 I am using Ply library for lexical analysis on some strings. 我正在使用Ply库对某些字符串进行词法分析。 I have implemented the following lexical analyzer for some of C++ language syntax. 我已经为某些C ++语言语法实现了以下词法分析器。

However, I am facing a strange behavior. 但是，我面临一个奇怪的行为。 When I define the COMMENT states function definitions at the end of other function definitions, the code works fine. 当我在其他函数定义的末尾定义COMMENT states function definitions时，代码可以正常工作。 If I define COMMENT state functions before other definitions, I get errors as soon as // sectoin starts in the input string starts. 如果我在其他定义之前定义COMMENT state functions ，则//输入字符串中的sectoin开始时就会出现错误。

WHAT IS THE REASON BEHIND THAT? 这是什么原因？

import ply.lex as lex

tokens = (
        'DLANGLE',       # <<
        'DRANGLE',       # >>
        'EQUAL',        # =
        'STRING',       # "144"
        'WORD',         # 'Welcome' in "Welcome."
        'SEMICOLON',    # ;

)

t_ignore                = ' \t\v\r' # shortcut for whitespace


states = (
        ('cppcomment', 'exclusive'),   # <!--
)



def t_cppcomment(t): # definition here causes errors
    r'//'
    print 'MyCOm:',t.value

    t.lexer.begin('cppcomment');



def t_cppcomment_end(t):
    r'\n'
    t.lexer.begin('INITIAL');


def t_cppcomment_error(t):
    print "Error FOUND"
    t.lexer.skip(1)

def t_DLANGLE(t):

    r'<<'
    print 'MyLAN:',t.value
    return t

def t_DRANGLE(t):
    r'>>'
    return t

def t_SEMICOLON(t):

    r';'
    print 'MySemi:',t.value
    return t;

def t_EQUAL(t):
        r'='
        return t

def t_STRING(t):
        r'"[^"]*"'
        t.value = t.value[1:-1] # drop "surrounding quotes"
        print 'MyString:',t.value
        return t

def t_WORD(t):
        r'[^ <>\n]+'
        print 'MyWord:',t.value
        return t




webpage = "cout<<\"Hello World\"; // this comment"
htmllexer = lex.lex()
htmllexer.input(webpage)
while True:
        tok = htmllexer.token()
        if not tok: break
        print tok

Regards 问候

Answer 1

Just figured it out. 只是想通了。 As I have defined comment state as exclusive , it won't use the inclusive state modules (if comment modules are defined at the top, otherwise it uses it for some reason). 正如我将注释状态定义为exclusive ，它不会使用inclusive状态模块（如果注释模块在顶部定义，否则出于某种原因会使用它）。 So you will have redefine all the modules for comment state again. 因此，您将再次为注释状态重新定义所有模块。 Therefore ply provides error() modules for skipping characters for which specific modules are not defined. 因此， ply提供了error（）模块，用于跳过未定义特定模块的字符。

Answer 2

its because you have no rules that accept this or comment and really you dont care about whats in the comment you can easilly do something like 它的，因为你没有规则，接受this或comment ，真的你不关于什么的评论，你可以easilly这样做护理

t_cppcomment_ANYTHING = '[^\r\n]'

just below your t_ignore rule 低于您的t_ignore规则

词法分析

问题描述

2 个解决方案

解决方案1
1 2014-03-27 21:17:40

解决方案2
0 2014-03-27 20:58:59

词法分析

问题描述

2 个解决方案

解决方案1 1 2014-03-27 21:17:40

解决方案2 0 2014-03-27 20:58:59

解决方案1
1 2014-03-27 21:17:40

解决方案2
0 2014-03-27 20:58:59