简体   繁体   English

从 Python 代码(正则表达式或 AST)字符串中提取所有变量

[英]Extract all variables from a string of Python code (regex or AST)

I want to find and extract all the variables in a string that contains Python code.我想在包含 Python 代码的字符串中查找并提取所有变量。 I only want to extract the variables (and variables with subscripts) but not function calls.我只想提取变量(和带下标的变量)而不是 function 调用。

For example, from the following string:例如,从以下字符串:

code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'

I want to extract: foo , bar[1] , baz[1:10:var1[2+1]] , var1[2+1] , qux[[1,2,int(var2)]] , var2 , bob[len("foobar")] , var3[0] .我想提取: foo , bar[1] , baz[1:10:var1[2+1]] , var1[2+1] , qux[[1,2,int(var2)]] , var2 , bob[len("foobar")] , var3[0] Please note that some variables may be "nested".请注意,某些变量可能是“嵌套的”。 For example, from baz[1:10:var1[2+1]] I want to extract baz[1:10:var1[2+1]] and var1[2+1] .例如,从baz[1:10:var1[2+1]]我想提取baz[1:10:var1[2+1]]var1[2+1]

The first two ideas that come to mind is to use either a regex or an AST.想到的前两个想法是使用正则表达式或 AST。 I have tried both but with no success.我都尝试过,但都没有成功。

When using a regex, in order to make things simpler, I thought it would be a good idea to first extract the "top level" variables, and then recursively the nested ones.使用正则表达式时,为了使事情更简单,我认为首先提取“顶级”变量,然后递归地提取嵌套变量是一个好主意。 Unfortunately, I can't even do that.不幸的是,我什至不能这样做。

This is what I have so far:这是我到目前为止所拥有的:

regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
    print(match)

Here is a demo: https://regex101.com/r/INPRdN/2这是一个演示: https://regex101.com/r/INPRdN/2

The other solution is to use an AST, extend ast.NodeVisitor , and implement the visit_Name and visit_Subscript methods.另一种解决方案是使用 AST,扩展ast.NodeVisitor并实现visit_Namevisit_Subscript方法。 However, this doesn't work either because visit_Name is also called for functions.但是,这也不起作用,因为函数也调用了visit_Name

I would appreciate if someone could provide me with a solution (regex or AST) to this problem.如果有人能为我提供这个问题的解决方案(正则表达式或 AST),我将不胜感激。

Thank you.谢谢你。

This answer might be too later.这个答案可能太晚了。 But it is possible to do it using python regex package.但是可以使用 python 正则表达式 package 来做到这一点。

import regex
code= '''foo + bar[1] + baz[1:10:var1[2+1]] + 
qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 
(var3[0])'''
p=r'(\b[a-z]\w*\b(?!\s*[\(\"])(\[(?:[^\[\]]|(?2))*\])?)' 
result=regex.findall(p,code,overlapped=True) #overlapped=True is needed to capture something inside a group like  'var1[2+1]'
[x[0] for x in result] #result variable is list of tuple of two,as each pattern capture two groups ,see below.  

output: output:
['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)]]','var2','bob[len("foobar")]','var3[0]'] ['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)] ]','var2','bob[len("foobar")]','var3[0]']

pattern explaination:图案解释:

  • ( # 1st capturing group start ( # 第一个捕获组开始
  • \b[az]\w*\b #variable name,eg 'bar' \b[az]\w*\b #变量名,例如'bar'
  • (?.\s*[\(\"]) #negative lookahead. so to ignore something like foobar" (?.\s*[\(\"]) #negative lookahead. 所以忽略像 foobar" 这样的东西
  • (\[(?:[^\[\]]|(?2))*\]) #2nd capture group,capture nested groups in '[ ]' (\[(?:[^\[\]]|(?2))*\]) #第2个捕获组,捕获'[ ]'中的嵌套组
    #eg '[1:10:var1[2+1]]'. #例如'[1:10:var1[2+1]]'。
    #'?2' refer to 2nd capturing group recursively #'?2' 递归引用第二个捕获组
  • ? ? #2nd capturing group is optional so to capture something like 'foo' #2nd 捕获组是可选的,因此可以捕获 'foo' 之类的内容
  • ) #end of 1st group. ) #第一组结束。

Regex is not a powerful enough tool to do this.正则表达式不是一个足够强大的工具来做到这一点。 If there is a finite depth of your nesting there is some hacky work around that would allow you to make complicate regex to do what you are looking for but I would not recommend it.如果你的嵌套深度有限,那么有一些 hacky 工作可以让你制作复杂的正则表达式来做你正在寻找的东西,但我不推荐它。

This is question is asked a lot an the linked response is famous for demonstrating the difficulty of what you are trying to do 这个问题被问了很多,并且链接的响应以证明您尝试做的事情的难度而闻名

If you really must parse a string for code an AST would technically work but I am not aware of a library to help with this.如果您真的必须为代码解析字符串,则 AST 在技术上可以工作,但我不知道有一个库可以帮助您解决这个问题。 You would be best off trying to build a recursive function to do the parsing.您最好尝试构建一个递归 function 来进行解析。

I find your question an interesting challenge, so here is a code that do what you want, doing this using Regex alone it's impossible because there is nested expression, this is a solution using a combination of Regex and string manipulations to handle nested expressions:我发现你的问题是一个有趣的挑战,所以这里有一个代码可以做你想做的事,单独使用正则Regex是不可能的,因为有嵌套表达式,这是一个使用正则Regex和字符串操作组合来处理嵌套表达式的解决方案:

# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')


def extract_expression(string):
    """ extract all identifier and getitem expression in the given order."""

    def remove_brackets(text):
        # 1. handle `[...]` expression replace them with #{#...#}#
        # so we don't confuse them with word[...]
        pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
        # keep extracting expression until there is no expression
        while re.search(pattern, text):
            text = re.sub(pattern, r'\1#{#\3#}#', string)
        return text

    def get_ordered_subexp(exp):
        """ get index of nested expression."""
        index = int(exp.replace('#', ''))
        subexp = RE_INDEX.findall(expressions[index])
        if not subexp:
            return exp
        return exp + ''.join(get_ordered_subexp(i) for i in subexp)

    def replace_expression(match):
        """ save the expression in the list, replace it with special key and it's index in the list."""
        match_exp = match.group(0)
        current_index = len(expressions)
        expressions.append(None)  # just to make sure the expression is inserted before it's inner identifier
        # if the expression contains identifier extract too.
        if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
            match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
        expressions[current_index] = match_exp
        return '##{}##'.format(current_index)

    def fix_expression(match):
        """ replace the match by the corresponding expression using the index"""
        return expressions[int(match.group(2))]

    # result that will contains
    expressions = []

    string = remove_brackets(string)

    # 2. extract all expression and keep track of there place in the original code
    pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
    # keep extracting expression until there is no expression
    while re.search(pattern, string):
        # every exression that is extracted is replaced by a special key
        string = re.sub(pattern, replace_expression, string)
        # some times inside brackets can contains getitem expression
        # so when we extract that expression we handle the brackets
        string = remove_brackets(string)

    # 3. build the correct result with extracted expressions
    result = [None] * len(expressions)
    for index, exp in enumerate(expressions):
        # keep replacing special keys with the correct expression
        while RE_INDEX_ONLY.search(exp):
            exp = RE_INDEX_ONLY.sub(fix_expression, exp)
        # finally we don't forget about the brackets
        result[index] = exp.replace('#{#', '[').replace('#}#', ']')

    # 4. Order the index that where extracted
    ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
    # convert it to integer
    ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]

    # 5. fix the order of expressions using the ordered indexes
    final_result = []
    for exp_index in ordered_index:
        final_result.append(result[exp_index])

    # for debug:
    # print('final string:', string)
    # print('expression :', expressions)
    # print('order_of_expresion: ', ordered_index)
    return final_result


code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))

OUTPU:输出:

['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']

I tested this code for very complicated examples and it worked perfectly.我针对非常复杂的示例测试了这段代码,它运行良好。 and notice that the order if extraction is the same as you wanted, Hope that this is what you needed.并注意提取的顺序与您想要的相同,希望这是您所需要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM