從 Python 代碼（正則表達式或 AST）字符串中提取所有變量

Question

我想在包含 Python 代碼的字符串中查找並提取所有變量。 我只想提取變量（和帶下標的變量）而不是 function 調用。

例如，從以下字符串：

code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'

我想提取： foo , bar[1] , baz[1:10:var1[2+1]] , var1[2+1] , qux[[1,2,int(var2)]] , var2 , bob[len("foobar")] , var3[0] 。 請注意，某些變量可能是“嵌套的”。 例如，從baz[1:10:var1[2+1]]我想提取baz[1:10:var1[2+1]]和var1[2+1] 。

想到的前兩個想法是使用正則表達式或 AST。 我都嘗試過，但都沒有成功。

使用正則表達式時，為了使事情更簡單，我認為首先提取“頂級”變量，然后遞歸地提取嵌套變量是一個好主意。 不幸的是，我什至不能這樣做。

這是我到目前為止所擁有的：

regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
    print(match)

這是一個演示： https://regex101.com/r/INPRdN/2

另一種解決方案是使用 AST，擴展ast.NodeVisitor並實現visit_Name和visit_Subscript方法。 但是，這也不起作用，因為函數也調用了visit_Name 。

如果有人能為我提供這個問題的解決方案（正則表達式或 AST），我將不勝感激。

謝謝你。

Answer 1

這個答案可能太晚了。 但是可以使用 python 正則表達式 package 來做到這一點。

import regex
code= '''foo + bar[1] + baz[1:10:var1[2+1]] + 
qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 
(var3[0])'''
p=r'(\b[a-z]\w*\b(?!\s*[\(\"])(\[(?:[^\[\]]|(?2))*\])?)' 
result=regex.findall(p,code,overlapped=True) #overlapped=True is needed to capture something inside a group like  'var1[2+1]'
[x[0] for x in result] #result variable is list of tuple of two,as each pattern capture two groups ,see below.

output：
['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)] ]','var2','bob[len("foobar")]','var3[0]']

圖案解釋：

( # 第一個捕獲組開始
\b[az]\w*\b #變量名，例如'bar'
(?.\s*[\(\"]) #negative lookahead. 所以忽略像 foobar" 這樣的東西
(\[(?:[^\[\]]|(?2))*\]) #第2個捕獲組，捕獲'[ ]'中的嵌套組
#例如'[1:10:var1[2+1]]'。
#'?2' 遞歸引用第二個捕獲組
? #2nd 捕獲組是可選的，因此可以捕獲 'foo' 之類的內容
) #第一組結束。

Answer 2

正則表達式不是一個足夠強大的工具來做到這一點。 如果你的嵌套深度有限，那么有一些 hacky 工作可以讓你制作復雜的正則表達式來做你正在尋找的東西，但我不推薦它。

這個問題被問了很多，並且鏈接的響應以證明您嘗試做的事情的難度而聞名

如果您真的必須為代碼解析字符串，則 AST 在技術上可以工作，但我不知道有一個庫可以幫助您解決這個問題。 您最好嘗試構建一個遞歸 function 來進行解析。

Answer 3

我發現你的問題是一個有趣的挑戰，所以這里有一個代碼可以做你想做的事，單獨使用正則Regex是不可能的，因為有嵌套表達式，這是一個使用正則Regex和字符串操作組合來處理嵌套表達式的解決方案：

# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')


def extract_expression(string):
    """ extract all identifier and getitem expression in the given order."""

    def remove_brackets(text):
        # 1. handle `[...]` expression replace them with #{#...#}#
        # so we don't confuse them with word[...]
        pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
        # keep extracting expression until there is no expression
        while re.search(pattern, text):
            text = re.sub(pattern, r'\1#{#\3#}#', string)
        return text

    def get_ordered_subexp(exp):
        """ get index of nested expression."""
        index = int(exp.replace('#', ''))
        subexp = RE_INDEX.findall(expressions[index])
        if not subexp:
            return exp
        return exp + ''.join(get_ordered_subexp(i) for i in subexp)

    def replace_expression(match):
        """ save the expression in the list, replace it with special key and it's index in the list."""
        match_exp = match.group(0)
        current_index = len(expressions)
        expressions.append(None)  # just to make sure the expression is inserted before it's inner identifier
        # if the expression contains identifier extract too.
        if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
            match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
        expressions[current_index] = match_exp
        return '##{}##'.format(current_index)

    def fix_expression(match):
        """ replace the match by the corresponding expression using the index"""
        return expressions[int(match.group(2))]

    # result that will contains
    expressions = []

    string = remove_brackets(string)

    # 2. extract all expression and keep track of there place in the original code
    pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
    # keep extracting expression until there is no expression
    while re.search(pattern, string):
        # every exression that is extracted is replaced by a special key
        string = re.sub(pattern, replace_expression, string)
        # some times inside brackets can contains getitem expression
        # so when we extract that expression we handle the brackets
        string = remove_brackets(string)

    # 3. build the correct result with extracted expressions
    result = [None] * len(expressions)
    for index, exp in enumerate(expressions):
        # keep replacing special keys with the correct expression
        while RE_INDEX_ONLY.search(exp):
            exp = RE_INDEX_ONLY.sub(fix_expression, exp)
        # finally we don't forget about the brackets
        result[index] = exp.replace('#{#', '[').replace('#}#', ']')

    # 4. Order the index that where extracted
    ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
    # convert it to integer
    ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]

    # 5. fix the order of expressions using the ordered indexes
    final_result = []
    for exp_index in ordered_index:
        final_result.append(result[exp_index])

    # for debug:
    # print('final string:', string)
    # print('expression :', expressions)
    # print('order_of_expresion: ', ordered_index)
    return final_result


code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))

輸出：

['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']

我針對非常復雜的示例測試了這段代碼，它運行良好。 並注意提取的順序與您想要的相同，希望這是您所需要的。

從 Python 代碼（正則表達式或 AST）字符串中提取所有變量

問題描述

3 個解決方案

解決方案1
1 2021-10-31 16:25:14

解決方案2
0 2019-10-04 13:32:07

解決方案3
0 2019-10-04 16:21:23

從 Python 代碼（正則表達式或 AST）字符串中提取所有變量

問題描述

3 個解決方案

解決方案1 1 2021-10-31 16:25:14

解決方案2 0 2019-10-04 13:32:07

解決方案3 0 2019-10-04 16:21:23

解決方案1
1 2021-10-31 16:25:14

解決方案2
0 2019-10-04 13:32:07

解決方案3
0 2019-10-04 16:21:23