[英]Lexical analyzer unable to recognize reserved words
I'm trying to develop a lexical analyzer which should tokenize and identify operators, identifiers, constants, reserved words, and data types in a code from an external text file, but the problem is I can't get it to identify reserved words or data types , it considers them as identifiers instead.我正在尝试开发一个词法分析器,它应该标记和识别来自外部文本文件的代码中的运算符、标识符、常量、保留字和数据类型,但问题是我无法让它识别保留字或数据类型,它将它们视为标识符。 I understand that this is because it matches the very first regex, but I still can't think of another way which will recognize both identifiers/variables as well as reserved words.我知道这是因为它与第一个正则表达式匹配,但我仍然想不出另一种既能识别标识符/变量又能识别保留字的方法。 Any ideas?有任何想法吗?
import re
tokens = []
sample_code = open("book.txt","r").read().split()
for word in sample_code:
if re.match("[a-zA-Z]+", word):
tokens.append([word,'is an Identifier'])
elif re.match("([1-9][0-9]*)|0", word):
if word[len(word) - 1] == ';':
tokens.append([word[:-1], "is Num Constant"])
tokens.append([';', 'is a Semi-colon'])
else:
tokens.append([word, "is a Num Constant"])
elif word in ['str', 'int', 'bool','float','char']:
tokens.append([word, 'is a Datatype'])
elif word in '><!*-/+%=':
tokens.append([word, "is an Operator"])
elif word in ['if','for','break','elif','else','while','then','call','do',
'endwhile', 'return','void','static','case','throw','private', 'public']:
tokens.append([word, "is a Reserved Word"])
print(tokens, '\n')
` `
Simply move your first regex "Identifier" to the end只需将您的第一个正则表达式“标识符”移到最后
import re
import numpy as np
import pandas as pd
tokens = []
sample_code = open("book.txt","r").read().split()
for word in sample_code:
if re.match("([1-9][0-9]*)|0", word):
if word[len(word) - 1] == ';':
tokens.append([word[:-1], "is Num Constant"])
tokens.append([';', 'is a Semi-colon'])
else:
tokens.append([word, "is a Num Constant"])
elif word in ['str', 'int', 'bool','float','char']:
tokens.append([word, 'is a Datatype'])
elif word in '><!*-/+%=':
tokens.append([word, "is an Operator"])
elif word in ['if','for','break','elif','else','while','then','call','do',
'endwhile', 'return','void','static','case','throw','private', 'public']:
tokens.append([word, "is a Reserved Word"])
elif re.match("[a-zA-Z]+", word):
if word[len(word) - 1] == ';':
tokens.append([word[:-1], "is an Identifier"])
tokens.append([';', 'is a Semi-colon'])
else:
tokens.append([word, "is an Identifier"])
tokens = np.array(tokens)
tokens = pd.DataFrame(tokens, columns=['Token','Token_Type'])
# tokens.to_excel('tokens.xlsx', index=False)
print(tokens)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.