![](/img/trans.png)
[英]Example ast code to parse Python source to extract docstring after global variables
[英]Extract all variables from a string of Python code (regex or AST)
我想在包含 Python 代碼的字符串中查找並提取所有變量。 我只想提取變量(和帶下標的變量)而不是 function 調用。
例如,從以下字符串:
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
我想提取: foo
, bar[1]
, baz[1:10:var1[2+1]]
, var1[2+1]
, qux[[1,2,int(var2)]]
, var2
, bob[len("foobar")]
, var3[0]
。 請注意,某些變量可能是“嵌套的”。 例如,從baz[1:10:var1[2+1]]
我想提取baz[1:10:var1[2+1]]
和var1[2+1]
。
想到的前兩個想法是使用正則表達式或 AST。 我都嘗試過,但都沒有成功。
使用正則表達式時,為了使事情更簡單,我認為首先提取“頂級”變量,然后遞歸地提取嵌套變量是一個好主意。 不幸的是,我什至不能這樣做。
這是我到目前為止所擁有的:
regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
print(match)
這是一個演示: https://regex101.com/r/INPRdN/2
另一種解決方案是使用 AST,擴展ast.NodeVisitor
並實現visit_Name
和visit_Subscript
方法。 但是,這也不起作用,因為函數也調用了visit_Name
。
如果有人能為我提供這個問題的解決方案(正則表達式或 AST),我將不勝感激。
謝謝你。
這個答案可能太晚了。 但是可以使用 python 正則表達式 package 來做到這一點。
import regex
code= '''foo + bar[1] + baz[1:10:var1[2+1]] +
qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2
(var3[0])'''
p=r'(\b[a-z]\w*\b(?!\s*[\(\"])(\[(?:[^\[\]]|(?2))*\])?)'
result=regex.findall(p,code,overlapped=True) #overlapped=True is needed to capture something inside a group like 'var1[2+1]'
[x[0] for x in result] #result variable is list of tuple of two,as each pattern capture two groups ,see below.
output:
['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)] ]','var2','bob[len("foobar")]','var3[0]']
圖案解釋:
正則表達式不是一個足夠強大的工具來做到這一點。 如果你的嵌套深度有限,那么有一些 hacky 工作可以讓你制作復雜的正則表達式來做你正在尋找的東西,但我不推薦它。
這個問題被問了很多,並且鏈接的響應以證明您嘗試做的事情的難度而聞名
如果您真的必須為代碼解析字符串,則 AST 在技術上可以工作,但我不知道有一個庫可以幫助您解決這個問題。 您最好嘗試構建一個遞歸 function 來進行解析。
我發現你的問題是一個有趣的挑戰,所以這里有一個代碼可以做你想做的事,單獨使用正則Regex
是不可能的,因為有嵌套表達式,這是一個使用正則Regex
和字符串操作組合來處理嵌套表達式的解決方案:
# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')
def extract_expression(string):
""" extract all identifier and getitem expression in the given order."""
def remove_brackets(text):
# 1. handle `[...]` expression replace them with #{#...#}#
# so we don't confuse them with word[...]
pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
# keep extracting expression until there is no expression
while re.search(pattern, text):
text = re.sub(pattern, r'\1#{#\3#}#', string)
return text
def get_ordered_subexp(exp):
""" get index of nested expression."""
index = int(exp.replace('#', ''))
subexp = RE_INDEX.findall(expressions[index])
if not subexp:
return exp
return exp + ''.join(get_ordered_subexp(i) for i in subexp)
def replace_expression(match):
""" save the expression in the list, replace it with special key and it's index in the list."""
match_exp = match.group(0)
current_index = len(expressions)
expressions.append(None) # just to make sure the expression is inserted before it's inner identifier
# if the expression contains identifier extract too.
if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
expressions[current_index] = match_exp
return '##{}##'.format(current_index)
def fix_expression(match):
""" replace the match by the corresponding expression using the index"""
return expressions[int(match.group(2))]
# result that will contains
expressions = []
string = remove_brackets(string)
# 2. extract all expression and keep track of there place in the original code
pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
# keep extracting expression until there is no expression
while re.search(pattern, string):
# every exression that is extracted is replaced by a special key
string = re.sub(pattern, replace_expression, string)
# some times inside brackets can contains getitem expression
# so when we extract that expression we handle the brackets
string = remove_brackets(string)
# 3. build the correct result with extracted expressions
result = [None] * len(expressions)
for index, exp in enumerate(expressions):
# keep replacing special keys with the correct expression
while RE_INDEX_ONLY.search(exp):
exp = RE_INDEX_ONLY.sub(fix_expression, exp)
# finally we don't forget about the brackets
result[index] = exp.replace('#{#', '[').replace('#}#', ']')
# 4. Order the index that where extracted
ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
# convert it to integer
ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]
# 5. fix the order of expressions using the ordered indexes
final_result = []
for exp_index in ordered_index:
final_result.append(result[exp_index])
# for debug:
# print('final string:', string)
# print('expression :', expressions)
# print('order_of_expresion: ', ordered_index)
return final_result
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))
輸出:
['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']
我針對非常復雜的示例測試了這段代碼,它運行良好。 並注意提取的順序與您想要的相同,希望這是您所需要的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.