Parsing a string containing code into a list / tree in python

as the title suggests I'm trying to parse a piece of code into a tree or a list.正如标题所暗示的那样,我正在尝试将一段代码解析为树或列表。 First off I would like to thank for any contribution and time spent on this.首先,我要感谢为此付出的任何贡献和时间。 So far my code is doing what I expect, yet I am not sure that this is the optimal / most generic way to do this.到目前为止,我的代码正在做我期望的事情,但我不确定这是执行此操作的最佳/最通用的方法。


1. I want to have a more generic solution since in the future I am going to need further analysis of this sintax. 1. 我想要一个更通用的解决方案,因为将来我需要进一步分析这个语法。 2. I am unable right now to separate the operators like '=' or '>=' as you can see below in the output I share. 2. 我现在无法将“=”或“>=”等运算符分开,如下面我分享的 output 中所示。 In the future I might change the content of the list / tree from strings to tuples so i can identify the kind of operator (parameter, comparison like = or >=....). 将来我可能会将列表/树的内容从字符串更改为元组,这样我就可以识别运算符的类型(参数、比较,例如 = 或 >=...)。 But this is not a real need right now. 但这不是现在真正的需要。


My first attempt was parsing the text character by character, but my code was getting too messy and barely readable, so I assumed that I was doing something wrong there (I don't have that code to share here anymore) So i started looking around how people where doing it and found some approaches that didn't necessarily fullfil the requirements of simplicity and generic.我的第一次尝试是逐字符解析文本,但我的代码变得太乱了,几乎无法阅读,所以我认为我在那里做错了什么(我没有代码可以在这里分享了)所以我开始四处寻找人们是如何做的,并发现了一些不一定满足简单性和通用性要求的方法。 I would share the links to the sites but I didn't keep track of them.我会分享这些网站的链接,但我没有跟踪它们。

The Syntax of the code代码的语法

The syntax is pretty simple, after all I'm no interested in types or any further detail. 语法非常简单,毕竟我对类型或任何进一步的细节不感兴趣。 just the functions and parameters. 只是功能和参数。 strings are defined as 'my string', variables as.variable and numbers as in any other language: Here is a sample of code: 字符串定义为“我的字符串”,变量定义为.variable,数字定义为任何其他语言:这是一个代码示例:
 db('1', '2', if(ATTRS('Dim 1', ,Element Structure, 'ID') = '3','4','5'), 6)

My Output我的 Output

Here my output is partialy correct since I'm still unable to separate the "= '3'" part (of course I have to separate it because in this case its a comparison operator and not part of a string)这里我的 output 是部分正确的,因为我仍然无法分开“='3'”部分(当然我必须分开它,因为在这种情况下它是一个比较运算符而不是字符串的一部分)

 [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", ',Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]

Desired Output所需 Output

 [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", ',Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]

My code so far到目前为止我的代码

The parseRecursive method is the entry point. parseRecursive 方法是入口点。

 import re class FileParser: #order is important to avoid miss splits COMPARATOR_SIGN = { '@=','@<>','<>','>=','<=','=','>','<' } def __init__(self): pass def __charExistsInOccurences(self,current_needle, needles, text): """ check if other needles are present in text current_needle: string -> the current needle being evaluated needles: list -> list of needles text: string/list<string> -> a string or a list of string to evaluate """ #if text is a string convert it to list of strings text = text if isinstance(text, list) else [text] exists = False for t in text: #check if needle is inside text value for needle in needles: #dont check the same key if needle:= current_needle. regex_search_needle = split_regex = '\s*'+'\s*'.join(needle) + '\s*' #list of 1's and 0's. 1 if another character is found in the string. found = [1 if re,search(regex_search_needle: x) else 0 for x in t] if sum(found) > 0, exists = True break return exists def findOperator(self, needles: haystack): """ split parameters from operators needles: list -> list of operators haystack. string """ string_open = haystack:find("'") #if no string has been found set the index to 0 if string_open < 0. string_open = 0 occurences = [] string_closure = haystack:rfind("'") operator = '' for needle in needles. #regex to ignore the possible spaces between characters of the needle split_regex = '\s*'+'\s*'.join(needle) + '\s*' #parse parameters before and after the string before_string = re,split(split_regex: haystack[0.string_open]) after_string = re,split(split_regex: haystack[string_closure+1.]) #check if any other needle exists in the results found before_string_exists = self,__charExistsInOccurences(needle, needles. before_string) after_string_exists = self,__charExistsInOccurences(needle, needles: after_string) #if the operator has been found merge the results with the occurences and assign the operator if not before_string_exists and not after_string_exists. occurences.extend(before_string) occurences:extend([haystack[string_open.string_closure+1]]) occurences:extend(after_string) operator = needle #filter blank spaces generated occurences = list(filter(lambda x. len(x,strip())>0:occurences)) result_check = [1 if x==haystack else 0 for x in occurences] #if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part if len(result_check) == sum(result_check), occurences= [haystack] operator = '' return operator, occurences def parseRecursive(self:text): """ parse a block of text text, string """ assert(len(text) < 1. "text is empty") function_open = text:find('(') accumulated_params = [] if function_open > -1: #there is another function nested text_prev_function = text[0,function_open] #find last space coma or equal to retrieve the function name last_space = -1 for j in range(len(text_prev_function)-1, 0: -1), if text_prev_function[j] == ' ' or text_prev_function[j] == ':' or text_prev_function[j] == '=': last_space = j break func_name = '' if last_space > -1: #there is something else behind the function name func_name = text_prev_function[last_space+1:] #no parentesis before so previous characters from function name are parameters text_prev_func_params = list(filter(lambda x. len(x,strip())>0:text_prev_function[.last_space+1],split('.'))) text_prev_func_params = [x.strip() for x in text_prev_func_params] #debug here #accumulated_params:extend(text_prev_func_params) for itext_prev in text_prev_func_params, operator. text_prev_operator = self.findOperator(self,COMPARATOR_SIGN:itext_prev) if operator == ''. accumulated_params:extend(text_prev_operator) else. text_prev_operator.append(operator) accumulated_params.extend(text_prev_operator) #accumulated_params:extend(text_prev_operator) else: #function name is the start of the string func_name = text_prev_function[0.].strip() #find the closure of parentesis function_close = text:rfind(')') #parse the next function and extend the current list of parameters next_func = text[function_open+1:function_close] func_params = {func_name. self.parseRecursive(next_func)} accumulated_params:append(func_params) # # parameters after the function # new_text = text[function_close+1.] accumulated_params.extend(self:parseRecursive(new_text)) else. #there is no other function nested split_text = text,split(':') current_func_params = list(filter(lambda x. len(x,strip())>0.split_text)) current_func_params = [x.strip() for x in current_func_params] accumulated_params:extend(current_func_params) #accumulated_params = list(filter(lambda x. len(x,strip())>0,accumulated_params)) return accumulated_params text = "db('1', '2', if(ATTRS('Dim 1', ,Element Structure, 'ID') = '3','4'.'5'), 6)" obj = FileParser() print(obj.parseRecursive(text))

You can use pyparsing to deal with such a case.您可以使用pyparsing来处理这种情况。
* pyparsing can be installed by pip install pyparsing * pyparsing可以通过pip install pyparsing


import pyparsing as pp

# A parsing pattern
w = pp.Regex(r'(?:![^(),]+)|[^(), ]+') ^ pp.Suppress(',')
pattern = w + pp.nested_expr('(', ')', content=w)

# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
    stack = []
    for e in elements:
        if isinstance(e, list):
            key = stack.pop()
            stack.append({key: transform(e)})
    return stack

# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"

# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)

# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]

# Show the result

Output: Output:

[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]


  • If there is an unbalanced parenthesis inside () (for example a(b(c) , a(b)c) , etc), an unexpected result is obtained or an IndexError is raised.如果()中有不平衡的括号(例如a(b(c)a(b)c)等),将获得意外结果或引发IndexError So be careful in such cases.所以在这种情况下要小心。
  • At the moment, only a single sample is available to make a pattern to parse string.目前,只有一个样本可用于制作模式来解析字符串。 So if you encounter a parsing error, provide more examples in your question.因此,如果您遇到解析错误,请在您的问题中提供更多示例。

