在为 Python 中的 C++ 代码制作词法分析器时遇到问题

Question

I'm trying to make a very simple lexical analyzer(Tokenizer) for C++ code from scratch, without using PLY or any other library.我正在尝试从头开始为 C++ 代码制作一个非常简单的词法分析器（Tokenizer），而不使用 PLY 或任何其他库。

Things I've done so far:到目前为止我做过的事情：

Defined the keywords, operators in dictionaries.定义字典中的关键字、运算符。
Defined the Regular Expressions for Comments, Literals, etc.定义了注释、文字等的正则表达式。

What I'm stuck with:我坚持的是：

Problem 1:问题一：

Now I'm trying to make a function check_line(line) which will consume a line of code and return the tokens in a Dictionary.现在我正在尝试制作一个 function check_line(line) ，它将消耗一行代码并在字典中返回令牌。 For example:例如：

check_line('int main()') check_line('int main()')

The output should be: output 应该是：

Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

But the output I'm getting is:但是我得到的 output 是：

Tokens = {'Keyword':'main', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

Because main is overwriting int here.因为 main 在这里覆盖了 int 。

Is there a way to tackle something like this?有没有办法解决这样的事情？

Problem 2:问题2：

When I pass check_line('int main()') inside the function, the program doesn't match main because here we have parenthesis with it.当我在 function 中传递check_line('int main()')时，程序与main不匹配，因为这里有括号。 How can I tackle this.我该如何解决这个问题。

I'm pasting the code I've written so far, please have a look and let me know what you think.我正在粘贴到目前为止我编写的代码，请看一下，让我知道你的想法。

import re

# Keywords
keywords = ['const','float','int','struct','break',
            'continue','else','for','switch','void',
            'case','enum','sizeof','typedef','char',
            'do','if','return','union','while','new',
            'public','class','friend','main']


# Regular Expression for Identifiers
re_id = '^[_]?[a-z]*[A-Z]([a-z]*[A-Z]*[0-9]+)'

# Regular Expression for Literals
re_int_lit = '^[+-]?[0-9]+'
re_float_lit = '^[+-]?([0-9]*)\.[0-9]+'
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'

# Regular expression of Comments
re_singleline_comment = '^//[a-zA-Z0-9 ]*'
re_multiline_comment = '^/\\*(.*?)\\*/'

operators = {'=':'Assignment','-':'Subtraction',
             '+':'Addition','*':'Multiplication',
            '/':'Division','++':'increment',
            '--':'Decrement','||':'OR', '&&':'AND',
            '<<':'Cout operator','>>':'Cin Operator',
            ';':'End of statement'}

io = {'cin':'User Input',
      'cout':'User Output'} 

brackets = {'[':'Open Square',']':'Close Square',
           '{':'Open Curly','}':'Close Curly',
           '(':'Open Small',')':'Close Small'}


# Function

def check_line(line):
    tokens = {}
    words = line.split(' ')
    for word in words:
        if word in operators.keys():
            tokens['Operator ' + word] = word

        if word in keywords:
            tokens['Keywords'] = word
        
        if re.match(re_singleline_comment,word):
            break
       
    return tokens


check_line('int main()')

Output: Output：

{'Keywords': 'main'}

The output should be: output 应该是：

Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

PS: I'm not done with the conditions yet, just trying to fix this first. PS：我还没有完成条件，只是想先解决这个问题。

Answer 1

A dictionary is a really bad choice of data structure for this function, since the essence of a dictionary is that each key is associated with exactly one corresponding value.对于这个 function，字典是一个非常糟糕的数据结构选择，因为字典的本质是每个键都与一个对应的值相关联。

What a tokenizer should return is quite different: an ordered stream of token objects.标记器应该返回的是完全不同的：标记对象的有序 stream。 In a simple implementation, that might be a list of tuples, but for any non-trivial application, you'll soon find that:在一个简单的实现中，这可能是一个元组列表，但对于任何非平凡的应用程序，您很快就会发现：

Tokens are not just a syntactic type and a string.标记不仅仅是句法类型和字符串。 There's lots of important auxiliary information, most notably the location of the token in the input stream (for error messages).有很多重要的辅助信息，最值得注意的是输入 stream 中令牌的位置（用于错误消息）。
Tokens are almost always consumed in sequence, and there is no particular advantage in producing more than one at a time.代币几乎总是按顺序消费的，一次生产多个代币并没有什么特别的优势。 In Python, a generator is a much more natural way of producing a stream of tokens.在 Python 中，生成器是生成令牌 stream 的更自然的方式。 If it were useful to create a list of tokens (for example, to implement a back-tracking parser), there would be no point working line by line, since line breaks are generally irrelevant in C++.如果创建标记列表很有用（例如，实现回溯解析器），那么逐行工作就没有意义，因为换行符在 C++ 中通常是不相关的。

As noted in a comment, C++ tokens are not always separated by whitespace, as is evident in your example input.如评论中所述， C++ 令牌并不总是由空格分隔，这在您的示例输入中很明显。 ( main() is three tokens without containing a single space character.) The best way of splitting program text into a token stream is to repeatedly match token patterns at the current input cursor, return the longest match, and move the input cursor over the match. （ main()是三个不包含单个空格字符的标记。）将程序文本拆分为标记 stream 的最佳方法是重复匹配当前输入 cursor 处的标记模式，返回最长匹配，并将输入 Z1791AZ76048A8402AEB9902匹配。

在为 Python 中的 C++ 代码制作词法分析器时遇到问题

问题描述

Things I've done so far:到目前为止我做过的事情：

What I'm stuck with:我坚持的是：

Problem 1:问题一：

Because main is overwriting int here.因为 main 在这里覆盖了 int 。

Problem 2:问题2：

Output: Output：

The output should be: output 应该是：

PS: I'm not done with the conditions yet, just trying to fix this first. PS：我还没有完成条件，只是想先解决这个问题。

1 个解决方案

解决方案1
1 已采纳 2021-01-08 20:18:17

在为 Python 中的 C++ 代码制作词法分析器时遇到问题

问题描述

Things I've done so far:到目前为止我做过的事情：

What I'm stuck with:我坚持的是：

Problem 1:问题一：

Because main is overwriting int here.因为 main 在这里覆盖了 int 。

Problem 2:问题2：

Output: Output：

The output should be: output 应该是：

PS: I'm not done with the conditions yet, just trying to fix this first. PS：我还没有完成条件，只是想先解决这个问题。

1 个解决方案

解决方案1 1 已采纳 2021-01-08 20:18:17

解决方案1
1 已采纳 2021-01-08 20:18:17