[英]Python and RegEx: Repeating Pattern Not Working
I have a text file with a proprietary programming language and I want to extract the relevant information about various function calls.我有一个带有专有编程语言的文本文件,我想提取有关各种函数调用的相关信息。
The structure of the function is:函数的结构是:
function name(input1, input2) returns (output1, output2) function body
函数名(input1, input2) 返回(output1, output2) 函数体
I'm using Python and RegEx to capture this information, but I've hit a snag.我正在使用 Python 和 RegEx 来捕获这些信息,但我遇到了障碍。 I can capture the name, the inputs and the outputs, but I am unable to grab all of the function body.
我可以捕获名称、输入和输出,但我无法获取所有函数体。
I use the following line to capture this info:我使用以下行来捕获此信息:
re.findall("(function)(.*?)\((.*?)\) returns \((.*?)\)(.*)", file_contents)
However, after the first instance of the word, 'function', this fails.然而,在单词“function”的第一个实例之后,这失败了。 Due to nested statements in the function body, I am unable to use a particular keyword (I've tried different approaches, and I cannot fully grab the entire body) to grab the last group (this would be the function body).
由于函数体中的嵌套语句,我无法使用特定关键字(我尝试了不同的方法,但无法完全抓取整个函数体)来抓取最后一组(这将是函数体)。
How can I group everything after a particular point and then repeat the pattern?如何在特定点之后对所有内容进行分组,然后重复该模式?
What I want: 'function', 'name', 'input1, input2', 'output1, output2', 'function body' to repeat indefinitely.我想要的是: 'function', 'name', 'input1, input2', 'output1, output2', 'function body' 无限重复。 I want the last group to grab everything after the outputs and then the pattern to restart when it gets to the next occurrence of the word 'function'.
我希望最后一组在输出之后抓取所有内容,然后在下一次出现“函数”一词时重新启动模式。 I've tried different variations of the (. ?) and (. ) quantifiers, but I can't seem to get it.
我尝试了 (. ?) 和 (. ) 量词的不同变体,但我似乎无法理解。
I am not a programmer by trade, so I am not that adept with RegEx or Python.我不是专业的程序员,所以我不太擅长使用 RegEx 或 Python。 I know just enough to do the very basics.
我知道的足以做最基本的事情。
This will grab the function up until the next function.这将抓取函数直到下一个函数。
There are 5 capture groups.有5个捕获组。
If using findall, post-process into a group of 5's to get results.如果使用 findall,则后处理为一组 5 以获得结果。
(?s)(\\bfunction\\b)(.*?)\\((.*?)\\)\\s+returns\\s+\\((.*?)\\)((?:(?!\\bfunction\\b).)*)
https://regex101.com/r/PkfofA/1 https://regex101.com/r/PkfofA/1
Expanded展开
(?s)
( \b function \b ) # (1)
( .*? ) # (2)
\(
( .*? ) # (3)
\) \s+ returns \s+ \(
( .*? ) # (4)
\)
( # (5 start)
(?:
(?! \b function \b )
.
)*
) # (5 end)
I guess finditer() is a way to get better control of each set of 5 groups :我想 finditer() 是一种更好地控制每组 5 个组的方法:
iter = re.finditer(r"(?s)(\bfunction\b)(.*?)\((.*?)\)\s+returns\s+\((.*?)\)((?:(?!\bfunction\b).)*)", target)
for result in iter:
g1 = result.group(1)
g2 = result.group(2)
g3 = result.group(3)
g4 = result.group(4)
g5 = result.group(5)
Based on further information from the comments, I tested the following regex code using the re.findall
function in Python3.6, which works with the example:根据评论中的更多信息,我使用
re.findall
中的re.findall
函数测试了以下正则表达式代码,该函数适用于示例:
import re
file_contents = "function func1(in1 : bool; in2 : bool; in3 : bool) returns ( out : bool) var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1; end \n random code \nfunction func2(in1 : bool; in2 : bool; in3 : bool) returns ( out : bool) var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1;"
pattern = r"(function) (.*?)\((.*?)\) returns \((.*?)\) (.*)"
regex_results = re.findall( pattern, file_contents )
print( regex_results )
Output:输出:
[('function', 'func1', 'in1 : bool; in2 : bool; in3 : bool', ' out : bool', 'var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1; end '), ('function', 'func2', 'in1 : bool; in2 : bool; in3 : bool', ' out : bool', 'var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1;')]
[('function', 'func1', 'in1 : bool; in2 : bool; in3 : bool', 'out : bool', 'var L1: bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3 and L4; L2 = L1 or L5; out = L2; L5 = in3; L4 = in2; L3 = in1; end '), ('function', 'func2', 'in1 : bool; in2 : bool; in3 : bool', 'out : bool', 'var L1 : bool; L2 : bool; L5 : bool; L4 : bool; L3 : bool; begin L1 = L3和L4; L2 = L1或L5; out = L2; L5 = in3; L4 = in2; L3 = in1;')]
I figured out a different way to accomplish what I was trying to do.我想出了一种不同的方法来完成我想做的事情。
I used the following line:我使用了以下行:
re.split('(function )(.*?)\\((.*?)\\) returns \\((.*?)\\)', contents)
This will split up what I wanted into a list.这将把我想要的东西分成一个列表。 I then chunk the list and assign it to the variables I have.
然后我将列表分块并将其分配给我拥有的变量。
Thanks for everyone who took the time to answer.感谢所有花时间回答的人。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.