简体   繁体   English

词法分析器无法识别保留字

[英]Lexical analyzer unable to recognize reserved words

I'm trying to develop a lexical analyzer which should tokenize and identify operators, identifiers, constants, reserved words, and data types in a code from an external text file, but the problem is I can't get it to identify reserved words or data types , it considers them as identifiers instead.我正在尝试开发一个词法分析器,它应该标记和识别来自外部文本文件的代码中的运算符、标识符、常量、保留字和数据类型,但问题是我无法让它识别保留字数据类型,它将它们视为标识符 I understand that this is because it matches the very first regex, but I still can't think of another way which will recognize both identifiers/variables as well as reserved words.我知道这是因为它与第一个正则表达式匹配,但我仍然想不出另一种既能识别标识符/变量又能识别保留字的方法。 Any ideas?有任何想法吗?

import re                                 

tokens = []                               
sample_code = open("book.txt","r").read().split()


for word in sample_code:

   
    if re.match("[a-zA-Z]+", word):
        tokens.append([word,'is an Identifier'])

    
    elif re.match("([1-9][0-9]*)|0", word):
        if word[len(word) - 1] == ';': 
            tokens.append([word[:-1], "is Num Constant"])
            tokens.append([';', 'is a Semi-colon'])
        else: 
            tokens.append([word, "is a Num Constant"])
    
    
    elif word in ['str', 'int', 'bool','float','char']: 
        tokens.append([word, 'is a Datatype'])
    

    elif word in '><!*-/+%=':
        tokens.append([word, "is an Operator"])
    

    elif word in ['if','for','break','elif','else','while','then','call','do',
    'endwhile', 'return','void','static','case','throw','private', 'public']:
        tokens.append([word, "is a Reserved Word"])
    
    
print(tokens, '\n') 

` `

Simply move your first regex "Identifier" to the end只需将您的第一个正则表达式“标识符”移到最后

import re
import numpy as np
import pandas as pd

tokens = []                               
sample_code = open("book.txt","r").read().split()

for word in sample_code:
    
    if re.match("([1-9][0-9]*)|0", word):
        if word[len(word) - 1] == ';': 
            tokens.append([word[:-1], "is Num Constant"])
            tokens.append([';', 'is a Semi-colon'])
        else: 
            tokens.append([word, "is a Num Constant"])
    
    elif word in ['str', 'int', 'bool','float','char']: 
        tokens.append([word, 'is a Datatype'])
    
    elif word in '><!*-/+%=':
        tokens.append([word, "is an Operator"])
    

    elif word in ['if','for','break','elif','else','while','then','call','do',
    'endwhile', 'return','void','static','case','throw','private', 'public']:
        tokens.append([word, "is a Reserved Word"])

    elif re.match("[a-zA-Z]+", word):
        if word[len(word) - 1] == ';':
            tokens.append([word[:-1], "is an Identifier"])
            tokens.append([';', 'is a Semi-colon'])
        else:
            tokens.append([word, "is an Identifier"])

tokens = np.array(tokens)
tokens = pd.DataFrame(tokens, columns=['Token','Token_Type'])
# tokens.to_excel('tokens.xlsx', index=False)
print(tokens)

The output looks like this: Image输出如下所示:图片

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM