简体   繁体   English

用python编写Sebesda的词法分析器。 对于输入文件中的最后一个词素不起作用

[英]Writing Sebesda's lexical analyzer in python. Does not work for last lexeme in the input file

I have to translate lexical analyzer the code in Sebesda's Concpets of Programming Languages (chapter 4, section 2) to python. 我必须将词法分析器的Sebesda编程语言概念(第4章,第2节)中的代码转换为python。 Here's what I have so far: 这是我到目前为止的内容:

# Character classes #
LETTER = 0
DIGIT = 1
UNKNOWN = 99

# Token Codes #
INT_LIT = 10
IDENT = 11
ASSIGN_OP = 20
ADD_OP= 21
SUB_OP = 22
MULT_OP = 23
DIV_OP = 24
LEFT_PAREN = 25
RIGHT_PAREN = 26

charClass = ''
lexeme = ''
lexLen = 0
token = ''
nextToken = ''

### lookup - function to lookup operators and parentheses ###
###          and return the token                         ###
def lookup(ch):
    def left_paren():
        addChar()
        globals()['nextToken'] = LEFT_PAREN

    def right_paren():
        addChar()
        globals()['nextToken'] = RIGHT_PAREN

    def add():
        addChar()
        globals()['nextToken'] = ADD_OP

    def subtract():
        addChar()
        globals()['nextToken'] = SUB_OP

    def multiply():
        addChar()
        globals()['nextToken'] = MULT_OP

    def divide():
        addChar()
        globals()['nextToken'] = DIV_OP
    options = {')': right_paren, '(': left_paren, '+': add,
               '-': subtract, '*': multiply , '/': divide}

    if ch in options.keys():
        options[ch]()
    else:
        addChar()

### addchar- a function to add next char to lexeme ###
def addChar():
    #lexeme = globals()['lexeme']
    if(len(globals()['lexeme']) <=98):
        globals()['lexeme'] += nextChar
    else:
        print("Error. Lexeme is too long")

### getChar- a function to get the next Character of input and determine its character class ###
def getChar():
    globals()['nextChar'] = globals()['contents'][0]
    if nextChar.isalpha():
        globals()['charClass'] = LETTER
    elif nextChar.isdigit():
        globals()['charClass'] = DIGIT
    else:
        globals()['charClass'] = UNKNOWN
    globals()['contents'] = globals()['contents'][1:]


## getNonBlank() - function to call getChar() until it returns a non whitespace character ##
def getNonBlank():
    while nextChar.isspace():
        getChar()

## lex- simple lexical analyzer for arithmetic functions ##
def lex():
    globals()['lexLen'] = 0
    getNonBlank()
    def letterfunc():
        globals()['lexeme'] = ''
        addChar()
        getChar()
        while(globals()['charClass'] == LETTER or globals()['charClass'] == DIGIT):
            addChar()
            getChar()
        globals()['nextToken'] = IDENT

    def digitfunc():
        globals()['lexeme'] = ''
        addChar()
        getChar()
        while(globals()['charClass'] == DIGIT):
            addChar()
            getChar()
        globals()['nextToken'] = INT_LIT

    def unknownfunc():
        globals()['lexeme'] = ''
        lookup(nextChar)
        getChar()

    lexDict = {LETTER: letterfunc, DIGIT: digitfunc, UNKNOWN: unknownfunc}
    if charClass in lexDict.keys():
        lexDict[charClass]()
    print('The next token is: '+ str(globals()['nextToken']) + ' The next lexeme is: ' + globals()['lexeme'])

with open('input.txt') as input:
    contents = input.read()
    getChar()
    lex()
    while contents:
        lex()

I'm using the string sum + 1 / 33 as my sample input string. 我使用字符串sum + 1 / 33作为样本输入字符串。 From what I understand, the first call to getChar() at the top level sets the characterClass to LETTER and contents to um + 1 / 33 . 据我了解,在顶层对getChar()的第一次调用将characterClass设置为LETTER, contentsum + 1 / 33 1/33。

The program then enters the while loop and calls lex() . 然后,程序进入while循环并调用lex() lex() in turn accumulates the word sum in to lexeme . lex()依次将单词sum累积到lexeme When the while loop inside letterfunc encounters the first white-space character, it breaks, exiting lex() letterfunc的while循环遇到第一个空格字符时,它将中断,退出lex()

Since contents is not empty, the program goes through the while loop at the top level again. 由于contents不为空,因此程序将再次在顶层进行while循环。 This time, the getNonBlank() call inside lex() "throws out the spaces in contents and the same process as before is repeated. 这次, lex()内部的getNonBlank()调用“将contents的空格扔掉,并重复与以前相同的过程。

Where I encounter an error, is at the last lexeme. 我遇到错误的地方是最后一个词素。 I'm told that globals()['contents'][0] is out of range when called by getChar() . 有人告诉我,当由getChar()调用globals()['contents'][0]超出范围。 I'm not expecting it to be a difficult error to find but I've tried tracing it by hand and can't seem to spot the problem. 我并不期望找到一个困难的错误,但是我尝试手工跟踪它,似乎无法发现问题。 Any help would be greatly appreciated. 任何帮助将不胜感激。

It is simply because after successfully reading the last 3 of input string, the digitfunc function iterate one more time getchar . 仅仅是因为在成功读取输入字符串的后3之后, digitfunc函数会再迭代一次getchar But at that moment content has been exhausted and is empty, so contents[0] is passed end of buffer, hence the error. 但是在那一刻, content已经用尽并且为空,因此content contents[0]被传递到缓冲区的末尾,因此出现了错误。

As a workaround, if you add a newline or a space after the last character of expression, your current code does not exhibit the problem. 解决方法是,如果在表达式的最后一个字符之后添加换行符或空格,则当前代码不会出现此问题。

The reason for that is that when last char is UNKNOWN you immediately return from lex and exit the loop because content is empty, but if your are processing a number or a symbol you loop calling getchar without testing end of input. 这样做的原因是,当最后一个char为UNKNOWN时,由于content为空,您会立即从lex返回并退出循环,但是如果您正在处理数字或符号,则循环调用getchar而不测试输入结束。 By the way, if your input string ends with a right paren, your lexer eats it and forget to display that it found it. 顺便说一句,如果您输入的字符串以正确的括号结尾,则词法分析器会吃掉它而忘记显示它找到了它。

So you should at least: 因此,您至少应:

  • test end of input in getchar: 在getchar中测试输入的结束:

     def getchar(): if len(contents) == 0: # print "END OF INPUT DETECTED" globals()['charClass'] = UNKNOWN globals()['nextChar'] = '' return ... 
  • display the last token if one is left: 如果剩下一个,则显示最后一个令牌:

     ... while contents: lex() lex() 
  • control if a lexeme is present (weird things may happen at end of input) 控制是否存在词素(输入结束时可能会发生奇怪的事情)

     ... if charClass in lexDict.keys(): lexDict[charClass]() if lexeme != '': print('The next token is: '+ str(globals()['nextToken']) + ' The next lexeme is: >' + globals()['lexeme'] + '<') 

But your usage of globals is bad . 但是您对globals的使用是不好的 The common idiom to use a global from within a function is to declare it before usage: 在函数中使用全局变量的惯用法是在使用之前声明它:

a = 5

def setA(val):
    global a
    a = val   # sets the global variable a

But globals in Python are code smell . 但是Python的全局变量具有代码异味 The best you could do is to properly encapsulate you parser in a class. 最好的办法是将解析器正确封装在一个类中。 Objects are better than globals 对象比全局对象好

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM