在為 Python 中的 C++ 代碼制作詞法分析器時遇到問題

Question

我正在嘗試從頭開始為 C++ 代碼制作一個非常簡單的詞法分析器（Tokenizer），而不使用 PLY 或任何其他庫。

到目前為止我做過的事情：

定義字典中的關鍵字、運算符。
定義了注釋、文字等的正則表達式。

我堅持的是：

問題一：

現在我正在嘗試制作一個 function check_line(line) ，它將消耗一行代碼並在字典中返回令牌。 例如：

check_line('int main()')

output 應該是：

Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

但是我得到的 output 是：

Tokens = {'Keyword':'main', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

因為 main 在這里覆蓋了 int 。

有沒有辦法解決這樣的事情？

問題2：

當我在 function 中傳遞check_line('int main()')時，程序與main不匹配，因為這里有括號。 我該如何解決這個問題。

我正在粘貼到目前為止我編寫的代碼，請看一下，讓我知道你的想法。

import re

# Keywords
keywords = ['const','float','int','struct','break',
            'continue','else','for','switch','void',
            'case','enum','sizeof','typedef','char',
            'do','if','return','union','while','new',
            'public','class','friend','main']


# Regular Expression for Identifiers
re_id = '^[_]?[a-z]*[A-Z]([a-z]*[A-Z]*[0-9]+)'

# Regular Expression for Literals
re_int_lit = '^[+-]?[0-9]+'
re_float_lit = '^[+-]?([0-9]*)\.[0-9]+'
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'

# Regular expression of Comments
re_singleline_comment = '^//[a-zA-Z0-9 ]*'
re_multiline_comment = '^/\\*(.*?)\\*/'

operators = {'=':'Assignment','-':'Subtraction',
             '+':'Addition','*':'Multiplication',
            '/':'Division','++':'increment',
            '--':'Decrement','||':'OR', '&&':'AND',
            '<<':'Cout operator','>>':'Cin Operator',
            ';':'End of statement'}

io = {'cin':'User Input',
      'cout':'User Output'} 

brackets = {'[':'Open Square',']':'Close Square',
           '{':'Open Curly','}':'Close Curly',
           '(':'Open Small',')':'Close Small'}


# Function

def check_line(line):
    tokens = {}
    words = line.split(' ')
    for word in words:
        if word in operators.keys():
            tokens['Operator ' + word] = word

        if word in keywords:
            tokens['Keywords'] = word
        
        if re.match(re_singleline_comment,word):
            break
       
    return tokens


check_line('int main()')

Output：

{'Keywords': 'main'}

output 應該是：

Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

PS：我還沒有完成條件，只是想先解決這個問題。

Answer 1

對於這個 function，字典是一個非常糟糕的數據結構選擇，因為字典的本質是每個鍵都與一個對應的值相關聯。

標記器應該返回的是完全不同的：標記對象的有序 stream。 在一個簡單的實現中，這可能是一個元組列表，但對於任何非平凡的應用程序，您很快就會發現：

標記不僅僅是句法類型和字符串。 有很多重要的輔助信息，最值得注意的是輸入 stream 中令牌的位置（用於錯誤消息）。
代幣幾乎總是按順序消費的，一次生產多個代幣並沒有什么特別的優勢。 在 Python 中，生成器是生成令牌 stream 的更自然的方式。 如果創建標記列表很有用（例如，實現回溯解析器），那么逐行工作就沒有意義，因為換行符在 C++ 中通常是不相關的。

如評論中所述， C++ 令牌並不總是由空格分隔，這在您的示例輸入中很明顯。 （ main()是三個不包含單個空格字符的標記。）將程序文本拆分為標記 stream 的最佳方法是重復匹配當前輸入 cursor 處的標記模式，返回最長匹配，並將輸入 Z1791AZ76048A8402AEB9902匹配。

在為 Python 中的 C++ 代碼制作詞法分析器時遇到問題

問題描述

到目前為止我做過的事情：

我堅持的是：

問題一：

因為 main 在這里覆蓋了 int 。

問題2：

Output：

output 應該是：

PS：我還沒有完成條件，只是想先解決這個問題。

1 個解決方案

解決方案1
1 已采納 2021-01-08 20:18:17

在為 Python 中的 C++ 代碼制作詞法分析器時遇到問題

問題描述

到目前為止我做過的事情：

我堅持的是：

問題一：

因為 main 在這里覆蓋了 int 。

問題2：

Output：

output 應該是：

PS：我還沒有完成條件，只是想先解決這個問題。

1 個解決方案

解決方案1 1 已采納 2021-01-08 20:18:17

解決方案1
1 已采納 2021-01-08 20:18:17